1. Identify GpG islands through a sequence generating model

An important use of HMMs is to decode or parse a genome into its biological components: exons, introns, regulatory regions, etc.

In this section, we will use GC content (the fraction of letters that are a C or a G) to classify the genome into high-GC regions (on average 60% G or C) and Low-GC regions (on average 60% A or T). These have different melting temperatures, different replication times across the cell cycle, and different gene density. They have also been hypothesized to have different evolutionary origins (see isochores), but this hypothesis remains controversial.

library(Biostrings)
genome_seq <- readDNAStringSet("genome.fa") #Import genome sequence
genome_seq #A DNAstringSet object
## DNAStringSet object of length 1:
##     width seq                                               names               
## [1] 10000 GCTGATATTAAAGACGAAACCAA...TACCGAATACTACTGAAGCACGC seq1

After loading the examplar genome sequence (stored in a FASTA file), its rolling GC contents is quantified via the sliding window of length 100 (realized through slidingViews()).

GC_cont_windows <- letterFrequency(DNAStringSet(slidingViews(genome_seq[[1]], 100)), "GC", as.prob = TRUE)
plot(GC_cont_windows, type="l", col="blue", xlab = "Genome position", ylab = "GC content")

From the above diagram, there are clearly several spikes of high GC regions, and our goal is to build a model that can predict those GC clusters (called CpG islands) along the genome sequence.

In the following question, we will fit a Hidden Markov Model with R package HMM. Since HMM only recognize an input sequence in the form of a character vector, we will first convert the letters in the DNA string into the distinct vector elements; this operation is made by stringr::str_split().

library(HMM)
observation <- stringr::str_split(as.character( genome_seq[[1]] ),"")[[1]]
str(observation)
##  chr [1:10000] "G" "C" "T" "G" "A" "T" "A" "T" "T" "A" "A" "A" "G" "A" "C" ...

hmm = initHMM(c("Background","CpG"), c("A","T","C","G"), transProbs=matrix(c(.95,.05,.1,.9),2, byrow = T),
emissionProbs=matrix(c(.25,.25,.25,.25,.2,.2,.3,.3),2,byrow = T))
print(hmm)
## $States
## [1] "Background" "CpG"       
## 
## $Symbols
## [1] "A" "T" "C" "G"
## 
## $startProbs
## Background        CpG 
##        0.5        0.5 
## 
## $transProbs
##             to
## from         Background  CpG
##   Background       0.95 0.05
##   CpG              0.10 0.90
## 
## $emissionProbs
##             symbols
## states          A    T    C    G
##   Background 0.25 0.25 0.25 0.25
##   CpG        0.20 0.20 0.30 0.30

P.S:

learned_hmm <- baumWelch(hmm, observation)$hmm
post <- posterior(learned_hmm, observation)
MAP <- viterbi(learned_hmm, observation)

print(learned_hmm) # Estimated HMM parameters from data
## $States
## [1] "Background" "CpG"       
## 
## $Symbols
## [1] "A" "T" "C" "G"
## 
## $startProbs
## Background        CpG 
##        0.5        0.5 
## 
## $transProbs
##             to
## from          Background          CpG
##   Background 0.999037656 0.0009623438
##   CpG        0.002139854 0.9978601464
## 
## $emissionProbs
##             symbols
## states               A         T         C         G
##   Background 0.2965490 0.3044674 0.2026804 0.1963033
##   CpG        0.2108789 0.1925011 0.3106051 0.2860150
str(post) # The state posterior
##  num [1:2, 1:10000] 0.9524 0.0476 0.9553 0.0447 0.9599 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ states: chr [1:2] "Background" "CpG"
##   ..$ index : chr [1:10000] "1" "2" "3" "4" ...
str(MAP) # The optimal state assignment 
##  chr [1:10000] "Background" "Background" "Background" "Background" ...

plot_df <- data.frame(pos = 1:10000,
                      value = c(post[2,],(MAP=="CpG")*1),
                      group = rep(c("Forward-backward algorithm","Viterbi algorithm"),each = 10000))

library(ggplot2)
ggplot(plot_df, aes(x=pos,y=value,group=group))+geom_line(colour = "blue")+facet_grid(~group)+theme_classic()

From the figures above, the values on the left are the probabilistic estimates of CpG island, while the values on the right are the Viterbi assignments. The output of the Viterbi algorithm can be viewed as the Bayes classification that provides binary predictions of the biological states along the genome sequence.

2. What is the average frequency of CpG island?

Given the parameters of a HMM, a meaningful question one can ask is the average proportion of states along the sequence. For example, in the GC case, one can ask what fraction of the genome are regions of CpG island.

To address this question, we need to rely on the fact that the latent states of a HMM is a Markov chain, and the states over a Markov chain reaches a stationary distribution after large number of transitions.

Since the state transition over a Markov chain can be calculated by matrix multiplication:

\[\pi^t=\pi^{(t-1)}A\] Where \(\pi^t\) is the state distribution at iteration \(t\), and \(A\) is the transition matrix.

A stationary state \(\pi\) is defined by:

\[\pi=\pi A\]

We can solve this problem from 2 angles:

  1. we run the state transitions over large # of iterations until the state distribution is fixed.
  2. we can compute the eigen decomposition on A, and the stationary distribution is obtained by the normalized eigen vector. This is true because \(\pi=\pi A\) implies \(\pi\) is the Eigen vector of \(A\).
A <- learned_hmm$transProbs
pi <- c(0.5,0.5) #Initial state distribution

pi2 <- pi %*% A
while( mean(abs(pi2 - pi)) > 1e-30 ){
  pi <- pi2
  pi2 <- pi %*% A
}

pi2 #Stationary distribution
##       to
##        Background       CpG
##   [1,]  0.6897864 0.3102136
#Alternative method
eigenvector <- eigen(t(A))$vectors[,1] #Calculate the eigen vector of A
eigenvector/sum(eigenvector)
## [1] 0.6897864 0.3102136

We could see that both of the methods yield the same result of the state stationary distribution in our CpG island model. One can conclude that, from our fitted HMM, around 0.382% of the genome over the long range will be predicted as the GC-enriched regions.

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.0       HMM_1.0             Biostrings_2.57.0  
## [4] XVector_0.29.0      IRanges_2.23.4      S4Vectors_0.27.5   
## [7] BiocGenerics_0.35.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9       compiler_4.0.3   pillar_1.4.4     tools_4.0.3     
##  [5] zlibbioc_1.35.0  digest_0.6.25    evaluate_0.14    lifecycle_0.2.0 
##  [9] tibble_3.0.1     gtable_0.3.0     pkgconfig_2.0.3  rlang_1.0.5     
## [13] cli_2.0.2        rstudioapi_0.11  yaml_2.2.1       xfun_0.13       
## [17] withr_2.2.0      dplyr_0.8.5      stringr_1.4.0    knitr_1.28      
## [21] vctrs_0.3.0      tidyselect_1.1.0 grid_4.0.3       glue_1.4.0      
## [25] R6_2.4.1         fansi_0.4.1      rmarkdown_2.1    farver_2.0.3    
## [29] purrr_0.3.4      magrittr_1.5     scales_1.1.1     codetools_0.2-16
## [33] htmltools_0.4.0  ellipsis_0.3.0   assertthat_0.2.1 colorspace_1.4-1
## [37] labeling_0.3     stringi_1.4.6    munsell_0.5.0    crayon_1.3.4