The Outline for Lab Practical 5.

Lab Practical 5 aims to discover a set of genomic features that could explain and predict the pattern of epigenetic status on the genome. Firstly, we will try to apply basic Bioconductor tools to extract the genomic land markers from the gene annotations. Then, we will analyze a published epigenetic modification data using the extracted annotations. Explicitly, we will investigate its consensus motifs, exon lengths, as well as its meta distributions on genes. In the end, we will try to develop a prediction tool for the epigenetic modification using machine learning, and see if we can integrate the genomic topologies discovered from the EDA to enhance its prediction performance.

This project tries to reproduce the key findings in the following papers, please check them if you need more background information:

1.Dominissini D, Moshitch-Moshkovitz S, Schwartz S, Salmon-Divon M, Ungar L, Osenberg S, Cesarkas K, Jacob-Hirsch J, Amariglio N, Kupiec M: Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 2012, 485(7397):201-206.

2.Meyer KD, Yogesh S, Paul Z, Olivier E, Mason CE, Jaffrey SR: Comprehensive analysis of mRNA methylation reveals enrichment in 3’ UTRs and near stop codons. Cell 2012, 149(7):1635–1646.

1. Extract Transcriptomic Landmarkers (10%)

First, our goal is to extract some important genomic features for the later analysis. Retrieve the 4 transcriptomic landmarks shown below from the TxDb package of hg19.

To achieve this, we need to rely on the combination of a set of “intra-range methods”. The extracted landmarks should be GRanges object with width = 1. Please store the results into the variables of TSS, TES, Start_codon, and Stop_codon, respectively.

Hint: resize() can be used to retrieve the start / end of a genomic reanges, transcripts() can extract the GRanges of full transcript (with introns) from TxDb object, and cdsBy() can retrieve a GRangesList that contains exons of CDS for each transcript / genes. Please try to understand these data structures, and be careful that the cdsBy() returns multiple exons of a CDS, not the full CDS with introns.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
## === Hint code, fill the "___" ============= ##
TSS <- resize(transcripts(txdb_hg19), width = 1, fix = "start")
TES <- resize(transcripts(txdb_hg19), width = 1, fix = "end")
Start_codon <- resize(unlist(range(cdsBy(txdb_hg19, by = "tx"))), width = 3, fix = "start")
Stop_codon <- resize(unlist(range(cdsBy(txdb_hg19, by = "tx"))), width = 3, fix = "end")
#Some extra hints, the missing functions above are all inter/intra-range methods

P.S. The uniqueness is defined as not having duplicated sites that get the same coordinate with the existing ones. You can apply the intra-range method reduce() to merge the duplicated GRanges.

Examine the Sequence Around the Start and Stop Codons (10%)

In this step, we will extract the sequences within the 3 bases windows containing Start and Stop condons, then, we will plot the sequence logo around the stop and start codons.

Plot the sequence logo for the start and stop codons using ggseqlogo(), set the method = “prob”, so that it will display the proportion of bases at each nucleotide position.

## === Hint code, fill the "___" ============= ##
library(ggseqlogo)
library(BSgenome.Hsapiens.UCSC.hg19)
ggseqlogo(as.vector(DNAStringSet(Views(Hsapiens, Start_codon))), method = "prob")

ggseqlogo(as.vector(DNAStringSet(Views(Hsapiens, Stop_codon))), method = "prob")

Then, check if the generated sequence logos being mostly consistent with the codon sequences in the common knowledge.

2. Motif Finding over Epigenetic Modification Peaks (20%)

Next, we will analyze a BED file named GSE63753.bed stored under the project directory. The BED file contains sites of a epigenetic marker downloaded from GEO under: (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63753)

The data was originally reported and uploaded by the following paper:

Linder, B., Grozhik, A. V., Olarerin-George, A. O., Meydan, C., Mason, C. E., & Jaffrey, S. R. (2015). Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nature methods, 12(8), 767-772.

The BED file contains the “single-based” sites of RNA m6A modification identified by a technique called miCLIP. m6A is a crucial covalent modification regulating mRNA localization, translation, and turn-over. Like many other key gene expression regulators, m6A sites are deposited via a consensus motifs along the genome.

In the following code chunk, we will first examine the sequence composition around the m6A modification sites, then, we will visualize the motif PPM using sequence logo plot.

library(rtracklayer)
## === Hint code, fill the "___" ============= ##
m6A_sites <- import("GSE63753.bed") #Load the BED file for single-based m6A
dss <- DNAStringSet(Views(Hsapiens, promoters(m6A_sites, upstream=2, downstream=3))) #Extract the underlying sequences of sites flanked by 2bp (+2)
dss
## DNAStringSet object of length 9536:
##        width seq
##    [1]     5 AGACT
##    [2]     5 GGACT
##    [3]     5 GAACG
##    [4]     5 TGACA
##    [5]     5 GGACT
##    ...   ... ...
## [9532]     5 TAACT
## [9533]     5 GGACT
## [9534]     5 CGACG
## [9535]     5 CGACA
## [9536]     5 ACACT
csm <- consensusMatrix(dss)[1:4,] #Calculate the consensus Matrix over nucleotides and positions
csm
##   [,1] [,2] [,3] [,4] [,5]
## A 2548 2551 9536    0 2819
## C  593  553    0 9536 1131
## G 4821 5567    0    0  607
## T 1574  865    0    0 4979
ggseqlogo(as.vector(dss), method = "prob") #visualize motif PPM with sequence logo 

Next, we will analyze motif using the m6A peaks identified from MeRIP-Seq (a CHIP-Seq like technique for RNA modification identification). The m6A peaks are stored in the BED file GSE54365.bed. In the code chunk below, extract the sequence behind the peaks, save the DNAStringSet as a fasta file in the current working directory.

## === Hint code, fill the "___" ============= ##
m6Apeaks <- import("GSE54365.bed")
peaks_dss <- DNAStringSet(Views(Hsapiens, m6Apeaks))
export(peaks_dss, "m6Apeaks.fasta")

Then, go to the following MEME website, enter the MEME page, upload the fasta file and run the MEME analysis, set both the minimum and maximum motif lengths equal to 5.

https://meme-suite.org/meme

3. Analysis of Epigenetic Modification using Genomic Features (35%)

Length of Exons (10%)

Subsequently, we will explore the correlation between exon lengths and m6A modification. One of the approaches to do it is to plot the distribution of exon lengths for exons containing or not containing the m6A sites.

Fill in the code below, so that we will create 2 vectors of the same lengths. The first vector named length_exon contains length for each exon on hg19, while the second vector named overlap_m6A is a dummy variable (logical), and it will be true if the exon contains m6A. Then, run the code afterward to plot the densities of exon lengths stratified by the overlapping status with the modification.

## === Hint code, fill the "___" ============= ##
ex_hg19 <- exons(txdb_hg19) #extract exons of hg19 using exons()
length_exon <- width(ex_hg19) #extract the lengths for each of the exons
overlap_m6A <- ex_hg19%over%m6A_sites #retrieve a dummy variable, true if exon overlapps with m6A
# ===== Your code is finished ================ ##
library(ggplot2)
library(ggsci)
ggplot(data.frame(length_exon,overlap_m6A)) +
                   geom_density(aes(x = log(length_exon),
                                    colour = overlap_m6A,
                                    fill = overlap_m6A), alpha = 0.5) + 
                   theme_classic() + scale_color_npg()

  • SAQ3: Interpret the density plot, what is the difference between the exons containing the modification and the exons without the modification? Do we more likely to observe m6A on long exons? If there is a boundary of exon length to classify the exons containing or not containing the m6A, what boundary would you choose?

  • Answer:
    • I think the main difference between the exons containing the modification and the exons without the modification is that the modificated one is more likely to be longer.
    • Yes, from the plot we can easily find that with the increase of exons length the more likely to obtain TURE of overlap_m6A
    • I’d like to choose 5.8 of the log length of exons length to be the boundary.

Topology (Distribution) of Markers on Genes (25%)

Next, we want to draw a meta-gene plot. The plot summarizes the spatial distribution of genomic features over transcript coordinates in relation to the start and the stop codons. You could see Figure 5D in Meyer KD.et.al Nature(2012) for a reference.

It turns out the information required to generate this figure is not complex: the only values needed are the relative positions of each site on 5’UTR, CDS, and 3’UTR. After computing the relative positions, the only step needed is to draw 3 histograms of the relative positions for each region, and the gene-wised distribution is just the concatenated result of the 3 histograms.

In the following code chunk, you need to create a function called relative_pos_on_region(). The function will return the relative position of its input GRanges (specified by parameter x) on the GRangesList (specified parameter region). The relative position on a region is defined as the distance toward the 5’ Start of the region divided by the full length of the region.

For example, if a site is located on the 200bp downstream of the start of a 5’UTR which has length of 1000bp, then its relative position on the 5’UTR is 200/1000 = 0.2. For the elements of x that do not overlap with the region, the function should behave by skipping those elements, and only the relative positions for the overlapped x are returned.

## === fill the "___" to complete the following function ============= ##
relative_pos_on_region <- function(x, region){
# `x` parameter shouuld be a GRanges of the queried sites on the Genome coordinate.
# `region` parameter should be a GRangesList of the transcript/5'UTR/CDS/3'UTR, the elements within it will be exons.
# The line below map the GRanges in x into the coordinate of the transcript/5'UTR/CDS/3'UTR (specified by region).
region_map <- mapToTranscripts(x, region)
# The line below calculates the exonic length of the transcript/5'UTR/CDS/3'UTR. You need to index the vector to make it match to the GRanges in x.
region_width <- sum(width(region))[region_map$transcriptsHits]
# The line below extract a vector for the start coordinate of transcript/5'UTR/CDS/3'UTR.
start_on_region <- region_map@ranges@start
# The line below returns a vector of the relative position by dividing the vectors for start coordinate and the region length.
region_width <- sum(width(region))[region_map$transcriptsHits]
}

plot_tx <- function(marker,utr5_gr,cds_gr,utr3_gr,marker_name){
  utr5_pos <- relative_pos_on_region(marker,utr5_gr)
  cds_pos <- relative_pos_on_region(marker,cds_gr)
  utr3_pos <- relative_pos_on_region(marker,utr3_gr)

  pldf <- data.frame(relative_pos = c(utr5_pos, cds_pos, utr3_pos),
                     tx_region = rep(c("5'UTR","CDS","3'UTR"),
                                     c(length(utr5_pos),length(cds_pos),length(utr3_pos)))
             )

  pldf$tx_region = factor(pldf$tx_region,levels = c("5'UTR","CDS","3'UTR"))

  ggplot(pldf) +
    geom_histogram(aes(x=relative_pos),bins = 50,
                   colour = "black", fill = "black") +
    facet_wrap(~tx_region) +
    theme_classic() +
    labs(title = marker_name, x = "Relative Position")
} #This function just organize the data and plot the histogram.

#Extract the regions for 5'UTR, CDS, and 3'UTR
UTR5 <- fiveUTRsByTranscript(txdb_hg19) 
CDS <- cdsBy(txdb_hg19, by = "tx")
UTR3 <- threeUTRsByTranscript(txdb_hg19)

#Then we will generate a serious of plot for the topology of previously extracted tx land markers
plot_tx(TSS,UTR5,CDS,UTR3,"TSS")

plot_tx(TES,UTR5,CDS,UTR3,"TES")

plot_tx(Stop_codon,UTR5,CDS,UTR3,"Stop Codon")

plot_tx(Start_codon,UTR5,CDS,UTR3,"Start Codon")

plot_tx(m6A_sites,UTR5,CDS,UTR3,"m6A")

  • SAQ4: Please interpret the computed topology distributions for m6A, which transcript regions are more likely to contain the modification sites? Do the relative positions on the regions important for m6A? The modification is mostly enriched around which transcript landmarks (e.x. TSS, start codon)? Please explain your reasoning.

  • Answer:
    • For m6A, CDS and 3’UTR position are the most likely to contain the modification sites. Especially the region of the end of CDS and the begain of 3’UTR. This is because the majority of the peaks appear to be located in the middle to the end of the transcripts, with only a few located near the 5’end.
    • Yes, those relative positions are very important for m6A, likely due to the presence of consensus motifs for the m6A methyltransferase complex in these regions.
    • Like the 3’UTR start codon are known to play important roles in regulating mRNA stability, so it is possible that the enrichment of m6A in these regions contributes to these regulatory functions.

4. Engineering Genomic Features for Site Prediction of Epigentic Markers (25%)

In lab practical 2, we have introduced the unsupervised learning methods for the prediction of tissue labels from gene expression data. Now, we want to again use a machine learning model to predict the modification status on consensus motifs. In this situation, we want to use a supervised learning approach, which means the prediction target is provided to the model during the training process.

Our training data is stored in the mcols() of m6A_ml, which contains 1000 training instances for both the positive and negative data. The metadata is a DataFrame that has columns for target and sequence-based features. The sequence-based features are 30 variables extracted only from the nucleotide sequences surrounding the modification sites, and their calculation is NOT dependent on the gene annotations (from only BSgenome but not from TxDb). What you need to do is to create the following annotation based genomic features as the additional metadata columns in m6A_ml:

Please fill the following code chunk to add the above mentioned genomic features into the prediction models.

After adding the features, we will run a high-level package to automatically build 4 prediction models and report their performance using 5 folds cross-validation. The 4 models are the combination of 2 ML algorithms (SVM and RandomForest) and 2 feature types (sequence feature along and sequence + genomic features).

You will get full mark for the coding if the genomic features you engineered can lead to an AUROC of more than 0.70 in any one of the ML algorithms.

m6A_ml <- readRDS("m6A_ml.rds") #m6A_ml is a GRanges, its metadata columns are features used in the prediction model

## === Hint code, fill the "___" ============= ##
m6A_ml$UTR5 <- m6A_ml %over% UTR5
m6A_ml$CDS <- m6A_ml %over% CDS 
m6A_ml$UTR3 <- m6A_ml %over% UTR3 
m6A_ml$long_exon <- m6A_ml %over% ex_hg19[log(width(ex_hg19)) > 5.95] 
m6A_ml$Stop_codon <- m6A_ml %over% Stop_codon 
m6A_ml$UTR3_pos <- RegionPropertiesFeatures::extractRegionRelativePosition(m6A_ml, UTR3, nomapValue = "0") 
# ===== Your code is finished ================ ##
library(perflite)
library(knitr)
set.seed(102)

perf_results <- performance_class(
  y = list(
    target_1 = as.factor(m6A_ml$m6A),
    target_2 = as.factor(m6A_ml$m6A)
  ), #list of response vectors
  X = list(
    sequence_feature = data.frame( mcols(m6A_ml)[,2:31] ),
    add_genomic_feature =  data.frame( mcols(m6A_ml)[,-1] ) 
  ), #list of feature matrixes
  k = 5, #number of folds in cross validation
  p = 1, #number of parallel computation
  cv_f = c(svm = svm_class,
           randomForest = randomForest_class)  #list of classifier functions.
)
## [1] "Start to test on learning method: svm"
## [1] "Using data pairs: sequence_feature and target_1."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
## [1] "Using data pairs: add_genomic_feature and target_2."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
## [1] "Start to test on learning method: randomForest"
## [1] "Using data pairs: sequence_feature and target_1."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
## [1] "Using data pairs: add_genomic_feature and target_2."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
kable(perf_results[[1]], 
       caption = names(perf_results)[1]) 
svm
AUROC ACC ERR SENS SPEC MCC
sequence_feature 0.6053 0.587 0.413 0.603 0.571 0.1741
add_genomic_feature 0.7000 0.648 0.352 0.640 0.656 0.2960
kable(perf_results[[2]],
       caption = names(perf_results)[2]) 
randomForest
AUROC ACC ERR SENS SPEC MCC
sequence_feature 0.6077 0.5770 0.4230 0.565 0.589 0.154
add_genomic_feature 0.6911 0.6445 0.3555 0.638 0.651 0.289


Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] knitr_1.28                             
##  [2] perflite_1.0.0                         
##  [3] ggsci_2.9                              
##  [4] ggplot2_3.3.0                          
##  [5] BSgenome.Hsapiens.UCSC.hg19_1.4.3      
##  [6] BSgenome_1.57.0                        
##  [7] rtracklayer_1.49.1                     
##  [8] Biostrings_2.57.0                      
##  [9] XVector_0.29.0                         
## [10] ggseqlogo_0.1                          
## [11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [12] GenomicFeatures_1.41.0                 
## [13] AnnotationDbi_1.51.0                   
## [14] Biobase_2.49.0                         
## [15] GenomicRanges_1.41.1                   
## [16] GenomeInfoDb_1.25.0                    
## [17] IRanges_2.23.4                         
## [18] S4Vectors_0.27.5                       
## [19] BiocGenerics_0.35.2                    
## 
## loaded via a namespace (and not attached):
##   [1] AnnotationHub_2.21.0            BiocFileCache_1.13.0           
##   [3] plyr_1.8.6                      lazyeval_0.2.2                 
##   [5] splines_4.0.3                   BiocParallel_1.23.0            
##   [7] digest_0.6.25                   foreach_1.5.0                  
##   [9] ensembldb_2.13.1                htmltools_0.4.0                
##  [11] magrittr_1.5                    memoise_1.1.0                  
##  [13] ROCR_1.0-11                     recipes_0.1.12                 
##  [15] gower_0.2.1                     matrixStats_0.56.0             
##  [17] askpass_1.1                     prettyunits_1.1.1              
##  [19] colorspace_1.4-1                blob_1.2.1                     
##  [21] rappdirs_0.3.1                  xfun_0.13                      
##  [23] dplyr_0.8.5                     crayon_1.3.4                   
##  [25] RCurl_1.98-1.2                  survival_3.2-7                 
##  [27] iterators_1.0.12                glue_1.4.0                     
##  [29] gtable_0.3.0                    ipred_0.9-9                    
##  [31] zlibbioc_1.35.0                 DelayedArray_0.15.1            
##  [33] Rhdf5lib_1.11.0                 RegionPropertiesFeatures_0.99.6
##  [35] HDF5Array_1.17.0                scales_1.1.1                   
##  [37] DBI_1.1.0                       Rcpp_1.0.4.6                   
##  [39] xtable_1.8-4                    progress_1.2.2                 
##  [41] bit_1.1-15.2                    lava_1.6.7                     
##  [43] prodlim_2019.11.13              httr_1.4.1                     
##  [45] RColorBrewer_1.1-2              ellipsis_0.3.0                 
##  [47] pkgconfig_2.0.3                 XML_3.99-0.3                   
##  [49] farver_2.0.3                    nnet_7.3-14                    
##  [51] dbplyr_1.4.3                    caret_6.0-88                   
##  [53] tidyselect_1.1.0                labeling_0.3                   
##  [55] rlang_0.4.6                     reshape2_1.4.4                 
##  [57] later_1.0.0                     munsell_0.5.0                  
##  [59] BiocVersion_3.12.0              tools_4.0.3                    
##  [61] generics_0.0.2                  RSQLite_2.2.0                  
##  [63] evaluate_0.14                   stringr_1.4.0                  
##  [65] fastmap_1.0.1                   yaml_2.2.1                     
##  [67] ModelMetrics_1.2.2.2            bit64_0.9-7                    
##  [69] purrr_0.3.4                     randomForest_4.6-14            
##  [71] AnnotationFilter_1.13.0         nlme_3.1-149                   
##  [73] mime_0.9                        biomaRt_2.45.0                 
##  [75] compiler_4.0.3                  curl_4.3                       
##  [77] interactiveDisplayBase_1.27.0   e1071_1.7-3                    
##  [79] tibble_3.0.1                    stringi_1.4.6                  
##  [81] highr_0.8                       lattice_0.20-41                
##  [83] ProtGenerics_1.21.0             Matrix_1.2-18                  
##  [85] vctrs_0.3.0                     pillar_1.4.4                   
##  [87] lifecycle_0.2.0                 BiocManager_1.30.10            
##  [89] data.table_1.12.8               bitops_1.0-6                   
##  [91] httpuv_1.5.2                    R6_2.4.1                       
##  [93] promises_1.1.0                  codetools_0.2-16               
##  [95] MASS_7.3-53                     gtools_3.8.2                   
##  [97] assertthat_0.2.1                rhdf5_2.33.0                   
##  [99] SummarizedExperiment_1.19.2     openssl_1.4.1                  
## [101] GenomicScores_2.1.0             withr_2.2.0                    
## [103] GenomicAlignments_1.25.0        Rsamtools_2.5.0                
## [105] GenomeInfoDbData_1.2.3          hms_0.5.3                      
## [107] grid_4.0.3                      rpart_4.1-15                   
## [109] timeDate_3043.102               class_7.3-17                   
## [111] rmarkdown_2.1                   pROC_1.16.2                    
## [113] shiny_1.4.0.2                   lubridate_1.7.8