An important use of HMMs is to decode or parse a genome into its biological components: exons, introns, regulatory regions, etc.
In this section, we will use GC content (the fraction of letters that are a C or a G) to classify the genome into high-GC regions (on average 60% G or C) and Low-GC regions (on average 60% A or T). These have different melting temperatures, different replication times across the cell cycle, and different gene density. They have also been hypothesized to have different evolutionary origins (see isochores), but this hypothesis remains controversial.
library(Biostrings)
genome_seq <- readDNAStringSet("genome.fa") #Import genome sequence
genome_seq #A DNAstringSet object
## DNAStringSet object of length 1:
## width seq names
## [1] 10000 GCTGATATTAAAGACGAAACCAA...TACCGAATACTACTGAAGCACGC seq1
After loading the examplar genome sequence (stored in a FASTA file), its rolling GC contents is quantified via the sliding window of length 100 (realized through slidingViews()
).
GC_cont_windows <- letterFrequency(DNAStringSet(slidingViews(genome_seq[[1]], 100)), "GC", as.prob = TRUE)
plot(GC_cont_windows, type="l", col="blue", xlab = "Genome position", ylab = "GC content")
From the above diagram, there are clearly several spikes of high GC regions, and our goal is to build a model that can predict those GC clusters (called CpG islands) along the genome sequence.
In the following question, we will fit a Hidden Markov Model with R package HMM
. Since HMM
only recognize an input sequence in the form of a character vector, we will first convert the letters in the DNA string into the distinct vector elements; this operation is made by stringr::str_split()
.
library(HMM)
observation <- stringr::str_split(as.character( genome_seq[[1]] ),"")[[1]]
str(observation)
## chr [1:10000] "G" "C" "T" "G" "A" "T" "A" "T" "T" "A" "A" "A" "G" "A" "C" ...
initHMM()
. Referring to the sample output, complete the following code chunk so that the transition and emission matrices are parameterized in the same way we have demonstrated in the lecture slides.hmm = initHMM(c("Background","CpG"), c("A","T","C","G"), transProbs=matrix(c(.95,.05,.1,.9),2, byrow = T),
emissionProbs=matrix(c(.25,.25,.25,.25,.2,.2,.3,.3),2,byrow = T))
print(hmm)
## $States
## [1] "Background" "CpG"
##
## $Symbols
## [1] "A" "T" "C" "G"
##
## $startProbs
## Background CpG
## 0.5 0.5
##
## $transProbs
## to
## from Background CpG
## Background 0.95 0.05
## CpG 0.10 0.90
##
## $emissionProbs
## symbols
## states A T C G
## Background 0.25 0.25 0.25 0.25
## CpG 0.20 0.20 0.30 0.30
HMM::
, and select the appropriate functions to realize the 3 key algorithms we have discussed in the lecture.P.S:
When training the HMM, use the initialized model in the previous question.
Make inferences only with the “learned” HMM parameters, not the initialized one.
The training function will return an object that need to be subsetted by $hmm
to get the HMM object.
learned_hmm <- baumWelch(hmm, observation)$hmm
post <- posterior(learned_hmm, observation)
MAP <- viterbi(learned_hmm, observation)
print(learned_hmm) # Estimated HMM parameters from data
## $States
## [1] "Background" "CpG"
##
## $Symbols
## [1] "A" "T" "C" "G"
##
## $startProbs
## Background CpG
## 0.5 0.5
##
## $transProbs
## to
## from Background CpG
## Background 0.999037656 0.0009623438
## CpG 0.002139854 0.9978601464
##
## $emissionProbs
## symbols
## states A T C G
## Background 0.2965490 0.3044674 0.2026804 0.1963033
## CpG 0.2108789 0.1925011 0.3106051 0.2860150
str(post) # The state posterior
## num [1:2, 1:10000] 0.9524 0.0476 0.9553 0.0447 0.9599 ...
## - attr(*, "dimnames")=List of 2
## ..$ states: chr [1:2] "Background" "CpG"
## ..$ index : chr [1:10000] "1" "2" "3" "4" ...
str(MAP) # The optimal state assignment
## chr [1:10000] "Background" "Background" "Background" "Background" ...
plot_df <- data.frame(pos = 1:10000,
value = c(post[2,],(MAP=="CpG")*1),
group = rep(c("Forward-backward algorithm","Viterbi algorithm"),each = 10000))
library(ggplot2)
ggplot(plot_df, aes(x=pos,y=value,group=group))+geom_line(colour = "blue")+facet_grid(~group)+theme_classic()
From the figures above, the values on the left are the probabilistic estimates of CpG island, while the values on the right are the Viterbi assignments. The output of the Viterbi algorithm can be viewed as the Bayes classification that provides binary predictions of the biological states along the genome sequence.
Given the parameters of a HMM, a meaningful question one can ask is the average proportion of states along the sequence. For example, in the GC case, one can ask what fraction of the genome are regions of CpG island.
To address this question, we need to rely on the fact that the latent states of a HMM is a Markov chain, and the states over a Markov chain reaches a stationary distribution after large number of transitions.
Since the state transition over a Markov chain can be calculated by matrix multiplication:
\[\pi^t=\pi^{(t-1)}A\] Where \(\pi^t\) is the state distribution at iteration \(t\), and \(A\) is the transition matrix.
A stationary state \(\pi\) is defined by:
\[\pi=\pi A\]
We can solve this problem from 2 angles:
A <- learned_hmm$transProbs
pi <- c(0.5,0.5) #Initial state distribution
pi2 <- pi %*% A
while( mean(abs(pi2 - pi)) > 1e-30 ){
pi <- pi2
pi2 <- pi %*% A
}
pi2 #Stationary distribution
## to
## Background CpG
## [1,] 0.6897864 0.3102136
#Alternative method
eigenvector <- eigen(t(A))$vectors[,1] #Calculate the eigen vector of A
eigenvector/sum(eigenvector)
## [1] 0.6897864 0.3102136
We could see that both of the methods yield the same result of the state stationary distribution in our CpG island model. One can conclude that, from our fitted HMM, around 0.382% of the genome over the long range will be predicted as the GC-enriched regions.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_3.3.0 HMM_1.0 Biostrings_2.57.0
## [4] XVector_0.29.0 IRanges_2.23.4 S4Vectors_0.27.5
## [7] BiocGenerics_0.35.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 compiler_4.0.3 pillar_1.4.4 tools_4.0.3
## [5] zlibbioc_1.35.0 digest_0.6.25 evaluate_0.14 lifecycle_0.2.0
## [9] tibble_3.0.1 gtable_0.3.0 pkgconfig_2.0.3 rlang_1.0.5
## [13] cli_2.0.2 rstudioapi_0.11 yaml_2.2.1 xfun_0.13
## [17] withr_2.2.0 dplyr_0.8.5 stringr_1.4.0 knitr_1.28
## [21] vctrs_0.3.0 tidyselect_1.1.0 grid_4.0.3 glue_1.4.0
## [25] R6_2.4.1 fansi_0.4.1 rmarkdown_2.1 farver_2.0.3
## [29] purrr_0.3.4 magrittr_1.5 scales_1.1.1 codetools_0.2-16
## [33] htmltools_0.4.0 ellipsis_0.3.0 assertthat_0.2.1 colorspace_1.4-1
## [37] labeling_0.3 stringi_1.4.6 munsell_0.5.0 crayon_1.3.4