Lab Practical 5 aims to discover a set of genomic features that could explain and predict the pattern of epigenetic status on the genome. Firstly, we will try to apply basic Bioconductor tools to extract the genomic land markers from the gene annotations. Then, we will analyze a published epigenetic modification data using the extracted annotations. Explicitly, we will investigate its consensus motifs, exon lengths, as well as its meta distributions on genes. In the end, we will try to develop a prediction tool for the epigenetic modification using machine learning, and see if we can integrate the genomic topologies discovered from the EDA to enhance its prediction performance.
This project tries to reproduce the key findings in the following papers, please check them if you need more background information:
1.Dominissini D, Moshitch-Moshkovitz S, Schwartz S, Salmon-Divon M, Ungar L, Osenberg S, Cesarkas K, Jacob-Hirsch J, Amariglio N, Kupiec M: Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 2012, 485(7397):201-206.
2.Meyer KD, Yogesh S, Paul Z, Olivier E, Mason CE, Jaffrey SR: Comprehensive analysis of mRNA methylation reveals enrichment in 3’ UTRs and near stop codons. Cell 2012, 149(7):1635–1646.
First, our goal is to extract some important genomic features for the later analysis. Retrieve the 4 transcriptomic landmarks shown below from the TxDb
package of hg19.
To achieve this, we need to rely on the combination of a set of “intra-range methods”. The extracted landmarks should be GRanges
object with width = 1. Please store the results into the variables of TSS
, TES
, Start_codon
, and Stop_codon
, respectively.
Hint: resize()
can be used to retrieve the start / end of a genomic reanges, transcripts()
can extract the GRanges
of full transcript (with introns) from TxDb
object, and cdsBy()
can retrieve a GRangesList
that contains exons of CDS for each transcript / genes. Please try to understand these data structures, and be careful that the cdsBy()
returns multiple exons of a CDS, not the full CDS with introns.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
## === Hint code, fill the "___" ============= ##
TSS <- resize(transcripts(txdb_hg19), width = 1, fix = "start")
TES <- resize(transcripts(txdb_hg19), width = 1, fix = "end")
Start_codon <- resize(unlist(range(cdsBy(txdb_hg19, by = "tx"))), width = 3, fix = "start")
Stop_codon <- resize(unlist(range(cdsBy(txdb_hg19, by = "tx"))), width = 3, fix = "end")
#Some extra hints, the missing functions above are all inter/intra-range methods
P.S. The uniqueness is defined as not having duplicated sites that get the same coordinate with the existing ones. You can apply the intra-range method reduce()
to merge the duplicated GRanges.
In this step, we will extract the sequences within the 3 bases windows containing Start and Stop condons, then, we will plot the sequence logo around the stop and start codons.
Plot the sequence logo for the start and stop codons using ggseqlogo()
, set the method = “prob”, so that it will display the proportion of bases at each nucleotide position.
## === Hint code, fill the "___" ============= ##
library(ggseqlogo)
library(BSgenome.Hsapiens.UCSC.hg19)
ggseqlogo(as.vector(DNAStringSet(Views(Hsapiens, Start_codon))), method = "prob")
ggseqlogo(as.vector(DNAStringSet(Views(Hsapiens, Stop_codon))), method = "prob")
Then, check if the generated sequence logos being mostly consistent with the codon sequences in the common knowledge.
Next, we will analyze a BED file named GSE63753.bed
stored under the project directory. The BED file contains sites of a epigenetic marker downloaded from GEO under: (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63753)
The data was originally reported and uploaded by the following paper:
Linder, B., Grozhik, A. V., Olarerin-George, A. O., Meydan, C., Mason, C. E., & Jaffrey, S. R. (2015). Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nature methods, 12(8), 767-772.
The BED file contains the “single-based” sites of RNA m6A modification identified by a technique called miCLIP. m6A is a crucial covalent modification regulating mRNA localization, translation, and turn-over. Like many other key gene expression regulators, m6A sites are deposited via a consensus motifs along the genome.
In the following code chunk, we will first examine the sequence composition around the m6A modification sites, then, we will visualize the motif PPM using sequence logo plot.
library(rtracklayer)
## === Hint code, fill the "___" ============= ##
m6A_sites <- import("GSE63753.bed") #Load the BED file for single-based m6A
dss <- DNAStringSet(Views(Hsapiens, promoters(m6A_sites, upstream=2, downstream=3))) #Extract the underlying sequences of sites flanked by 2bp (+2)
dss
## DNAStringSet object of length 9536:
## width seq
## [1] 5 AGACT
## [2] 5 GGACT
## [3] 5 GAACG
## [4] 5 TGACA
## [5] 5 GGACT
## ... ... ...
## [9532] 5 TAACT
## [9533] 5 GGACT
## [9534] 5 CGACG
## [9535] 5 CGACA
## [9536] 5 ACACT
csm <- consensusMatrix(dss)[1:4,] #Calculate the consensus Matrix over nucleotides and positions
csm
## [,1] [,2] [,3] [,4] [,5]
## A 2548 2551 9536 0 2819
## C 593 553 0 9536 1131
## G 4821 5567 0 0 607
## T 1574 865 0 0 4979
ggseqlogo(as.vector(dss), method = "prob") #visualize motif PPM with sequence logo
Next, we will analyze motif using the m6A peaks identified from MeRIP-Seq (a CHIP-Seq like technique for RNA modification identification). The m6A peaks are stored in the BED file GSE54365.bed
. In the code chunk below, extract the sequence behind the peaks, save the DNAStringSet as a fasta file in the current working directory.
## === Hint code, fill the "___" ============= ##
m6Apeaks <- import("GSE54365.bed")
peaks_dss <- DNAStringSet(Views(Hsapiens, m6Apeaks))
export(peaks_dss, "m6Apeaks.fasta")
Then, go to the following MEME website, enter the MEME page, upload the fasta file and run the MEME analysis, set both the minimum and maximum motif lengths equal to 5.
SAQ2: Based on the MEME report, is there a similar motif identified as we have seen in the miCLIP data? What is the E-value of the m6A motif?
Subsequently, we will explore the correlation between exon lengths and m6A modification. One of the approaches to do it is to plot the distribution of exon lengths for exons containing or not containing the m6A sites.
Fill in the code below, so that we will create 2 vectors of the same lengths. The first vector named length_exon
contains length for each exon on hg19, while the second vector named overlap_m6A
is a dummy variable (logical), and it will be true if the exon contains m6A. Then, run the code afterward to plot the densities of exon lengths stratified by the overlapping status with the modification.
## === Hint code, fill the "___" ============= ##
ex_hg19 <- exons(txdb_hg19) #extract exons of hg19 using exons()
length_exon <- width(ex_hg19) #extract the lengths for each of the exons
overlap_m6A <- ex_hg19%over%m6A_sites #retrieve a dummy variable, true if exon overlapps with m6A
# ===== Your code is finished ================ ##
library(ggplot2)
library(ggsci)
ggplot(data.frame(length_exon,overlap_m6A)) +
geom_density(aes(x = log(length_exon),
colour = overlap_m6A,
fill = overlap_m6A), alpha = 0.5) +
theme_classic() + scale_color_npg()
SAQ3: Interpret the density plot, what is the difference between the exons containing the modification and the exons without the modification? Do we more likely to observe m6A on long exons? If there is a boundary of exon length to classify the exons containing or not containing the m6A, what boundary would you choose?
overlap_m6A
Next, we want to draw a meta-gene plot. The plot summarizes the spatial distribution of genomic features over transcript coordinates in relation to the start and the stop codons. You could see Figure 5D in Meyer KD.et.al Nature(2012) for a reference.
It turns out the information required to generate this figure is not complex: the only values needed are the relative positions of each site on 5’UTR, CDS, and 3’UTR. After computing the relative positions, the only step needed is to draw 3 histograms of the relative positions for each region, and the gene-wised distribution is just the concatenated result of the 3 histograms.
In the following code chunk, you need to create a function called relative_pos_on_region()
. The function will return the relative position of its input GRanges (specified by parameter x
) on the GRangesList (specified parameter region
). The relative position on a region is defined as the distance toward the 5’ Start of the region divided by the full length of the region.
For example, if a site is located on the 200bp downstream of the start of a 5’UTR which has length of 1000bp, then its relative position on the 5’UTR is 200/1000 = 0.2. For the elements of x that do not overlap with the region, the function should behave by skipping those elements, and only the relative positions for the overlapped x are returned.
## === fill the "___" to complete the following function ============= ##
relative_pos_on_region <- function(x, region){
# `x` parameter shouuld be a GRanges of the queried sites on the Genome coordinate.
# `region` parameter should be a GRangesList of the transcript/5'UTR/CDS/3'UTR, the elements within it will be exons.
# The line below map the GRanges in x into the coordinate of the transcript/5'UTR/CDS/3'UTR (specified by region).
region_map <- mapToTranscripts(x, region)
# The line below calculates the exonic length of the transcript/5'UTR/CDS/3'UTR. You need to index the vector to make it match to the GRanges in x.
region_width <- sum(width(region))[region_map$transcriptsHits]
# The line below extract a vector for the start coordinate of transcript/5'UTR/CDS/3'UTR.
start_on_region <- region_map@ranges@start
# The line below returns a vector of the relative position by dividing the vectors for start coordinate and the region length.
region_width <- sum(width(region))[region_map$transcriptsHits]
}
plot_tx <- function(marker,utr5_gr,cds_gr,utr3_gr,marker_name){
utr5_pos <- relative_pos_on_region(marker,utr5_gr)
cds_pos <- relative_pos_on_region(marker,cds_gr)
utr3_pos <- relative_pos_on_region(marker,utr3_gr)
pldf <- data.frame(relative_pos = c(utr5_pos, cds_pos, utr3_pos),
tx_region = rep(c("5'UTR","CDS","3'UTR"),
c(length(utr5_pos),length(cds_pos),length(utr3_pos)))
)
pldf$tx_region = factor(pldf$tx_region,levels = c("5'UTR","CDS","3'UTR"))
ggplot(pldf) +
geom_histogram(aes(x=relative_pos),bins = 50,
colour = "black", fill = "black") +
facet_wrap(~tx_region) +
theme_classic() +
labs(title = marker_name, x = "Relative Position")
} #This function just organize the data and plot the histogram.
#Extract the regions for 5'UTR, CDS, and 3'UTR
UTR5 <- fiveUTRsByTranscript(txdb_hg19)
CDS <- cdsBy(txdb_hg19, by = "tx")
UTR3 <- threeUTRsByTranscript(txdb_hg19)
#Then we will generate a serious of plot for the topology of previously extracted tx land markers
plot_tx(TSS,UTR5,CDS,UTR3,"TSS")
plot_tx(TES,UTR5,CDS,UTR3,"TES")
plot_tx(Stop_codon,UTR5,CDS,UTR3,"Stop Codon")
plot_tx(Start_codon,UTR5,CDS,UTR3,"Start Codon")
plot_tx(m6A_sites,UTR5,CDS,UTR3,"m6A")
SAQ4: Please interpret the computed topology distributions for m6A, which transcript regions are more likely to contain the modification sites? Do the relative positions on the regions important for m6A? The modification is mostly enriched around which transcript landmarks (e.x. TSS, start codon)? Please explain your reasoning.
In lab practical 2, we have introduced the unsupervised learning methods for the prediction of tissue labels from gene expression data. Now, we want to again use a machine learning model to predict the modification status on consensus motifs. In this situation, we want to use a supervised learning approach, which means the prediction target is provided to the model during the training process.
Our training data is stored in the mcols()
of m6A_ml
, which contains 1000 training instances for both the positive and negative data. The metadata is a DataFrame
that has columns for target and sequence-based features. The sequence-based features are 30 variables extracted only from the nucleotide sequences surrounding the modification sites, and their calculation is NOT dependent on the gene annotations (from only BSgenome but not from TxDb). What you need to do is to create the following annotation based genomic features as the additional metadata columns in m6A_ml
:
UTR5
: a logical variable indicating the site overlapping with 5’UTR.CDS
: a logical variable indicating the site overlapping with CDS.UTR3
: a logical variable indicating the site overlapping with 3’UTR.long_exon
: a logical variable for sites overlapping with long exons (exon length > XXX, XXX is a boundary number choosen by you).Stop_codon
: a logical variable for sites overlapping with the XXX bp centered by a stop codon. (XXX is also choosen by you)UTR3_pos
: a real number value for the relative position of the site on 3’UTR (0 if not on 3’UTR).CDS_pos
: a real number value for the relative position of the site on CDS (0 if not on CDS).Please fill the following code chunk to add the above mentioned genomic features into the prediction models.
After adding the features, we will run a high-level package to automatically build 4 prediction models and report their performance using 5 folds cross-validation. The 4 models are the combination of 2 ML algorithms (SVM and RandomForest) and 2 feature types (sequence feature along and sequence + genomic features).
You will get full mark for the coding if the genomic features you engineered can lead to an AUROC of more than 0.70 in any one of the ML algorithms.
m6A_ml <- readRDS("m6A_ml.rds") #m6A_ml is a GRanges, its metadata columns are features used in the prediction model
## === Hint code, fill the "___" ============= ##
m6A_ml$UTR5 <- m6A_ml %over% UTR5
m6A_ml$CDS <- m6A_ml %over% CDS
m6A_ml$UTR3 <- m6A_ml %over% UTR3
m6A_ml$long_exon <- m6A_ml %over% ex_hg19[log(width(ex_hg19)) > 5.95]
m6A_ml$Stop_codon <- m6A_ml %over% Stop_codon
m6A_ml$UTR3_pos <- RegionPropertiesFeatures::extractRegionRelativePosition(m6A_ml, UTR3, nomapValue = "0")
# ===== Your code is finished ================ ##
library(perflite)
library(knitr)
set.seed(102)
perf_results <- performance_class(
y = list(
target_1 = as.factor(m6A_ml$m6A),
target_2 = as.factor(m6A_ml$m6A)
), #list of response vectors
X = list(
sequence_feature = data.frame( mcols(m6A_ml)[,2:31] ),
add_genomic_feature = data.frame( mcols(m6A_ml)[,-1] )
), #list of feature matrixes
k = 5, #number of folds in cross validation
p = 1, #number of parallel computation
cv_f = c(svm = svm_class,
randomForest = randomForest_class) #list of classifier functions.
)
## [1] "Start to test on learning method: svm"
## [1] "Using data pairs: sequence_feature and target_1."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
## [1] "Using data pairs: add_genomic_feature and target_2."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
## [1] "Start to test on learning method: randomForest"
## [1] "Using data pairs: sequence_feature and target_1."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
## [1] "Using data pairs: add_genomic_feature and target_2."
## [1] "Fold 1 training..."
## [1] "Fold 2 training..."
## [1] "Fold 3 training..."
## [1] "Fold 4 training..."
## [1] "Fold 5 training..."
kable(perf_results[[1]],
caption = names(perf_results)[1])
AUROC | ACC | ERR | SENS | SPEC | MCC | |
---|---|---|---|---|---|---|
sequence_feature | 0.6053 | 0.587 | 0.413 | 0.603 | 0.571 | 0.1741 |
add_genomic_feature | 0.7000 | 0.648 | 0.352 | 0.640 | 0.656 | 0.2960 |
kable(perf_results[[2]],
caption = names(perf_results)[2])
AUROC | ACC | ERR | SENS | SPEC | MCC | |
---|---|---|---|---|---|---|
sequence_feature | 0.6077 | 0.5770 | 0.4230 | 0.565 | 0.589 | 0.154 |
add_genomic_feature | 0.6911 | 0.6445 | 0.3555 | 0.638 | 0.651 | 0.289 |
SAQ5: For the prediction made by SVM, after adding genomic features (annotation-based), how much the AUROC is improved compared with only using the sequence features? How could you explain such an improvement? If an annotation-based feature is important at improving the prediction performance, could we infer the annotation being biologically/scientifically significant for the predicted epigenetic marker? Please explain your reasoning.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] knitr_1.28
## [2] perflite_1.0.0
## [3] ggsci_2.9
## [4] ggplot2_3.3.0
## [5] BSgenome.Hsapiens.UCSC.hg19_1.4.3
## [6] BSgenome_1.57.0
## [7] rtracklayer_1.49.1
## [8] Biostrings_2.57.0
## [9] XVector_0.29.0
## [10] ggseqlogo_0.1
## [11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [12] GenomicFeatures_1.41.0
## [13] AnnotationDbi_1.51.0
## [14] Biobase_2.49.0
## [15] GenomicRanges_1.41.1
## [16] GenomeInfoDb_1.25.0
## [17] IRanges_2.23.4
## [18] S4Vectors_0.27.5
## [19] BiocGenerics_0.35.2
##
## loaded via a namespace (and not attached):
## [1] AnnotationHub_2.21.0 BiocFileCache_1.13.0
## [3] plyr_1.8.6 lazyeval_0.2.2
## [5] splines_4.0.3 BiocParallel_1.23.0
## [7] digest_0.6.25 foreach_1.5.0
## [9] ensembldb_2.13.1 htmltools_0.4.0
## [11] magrittr_1.5 memoise_1.1.0
## [13] ROCR_1.0-11 recipes_0.1.12
## [15] gower_0.2.1 matrixStats_0.56.0
## [17] askpass_1.1 prettyunits_1.1.1
## [19] colorspace_1.4-1 blob_1.2.1
## [21] rappdirs_0.3.1 xfun_0.13
## [23] dplyr_0.8.5 crayon_1.3.4
## [25] RCurl_1.98-1.2 survival_3.2-7
## [27] iterators_1.0.12 glue_1.4.0
## [29] gtable_0.3.0 ipred_0.9-9
## [31] zlibbioc_1.35.0 DelayedArray_0.15.1
## [33] Rhdf5lib_1.11.0 RegionPropertiesFeatures_0.99.6
## [35] HDF5Array_1.17.0 scales_1.1.1
## [37] DBI_1.1.0 Rcpp_1.0.4.6
## [39] xtable_1.8-4 progress_1.2.2
## [41] bit_1.1-15.2 lava_1.6.7
## [43] prodlim_2019.11.13 httr_1.4.1
## [45] RColorBrewer_1.1-2 ellipsis_0.3.0
## [47] pkgconfig_2.0.3 XML_3.99-0.3
## [49] farver_2.0.3 nnet_7.3-14
## [51] dbplyr_1.4.3 caret_6.0-88
## [53] tidyselect_1.1.0 labeling_0.3
## [55] rlang_0.4.6 reshape2_1.4.4
## [57] later_1.0.0 munsell_0.5.0
## [59] BiocVersion_3.12.0 tools_4.0.3
## [61] generics_0.0.2 RSQLite_2.2.0
## [63] evaluate_0.14 stringr_1.4.0
## [65] fastmap_1.0.1 yaml_2.2.1
## [67] ModelMetrics_1.2.2.2 bit64_0.9-7
## [69] purrr_0.3.4 randomForest_4.6-14
## [71] AnnotationFilter_1.13.0 nlme_3.1-149
## [73] mime_0.9 biomaRt_2.45.0
## [75] compiler_4.0.3 curl_4.3
## [77] interactiveDisplayBase_1.27.0 e1071_1.7-3
## [79] tibble_3.0.1 stringi_1.4.6
## [81] highr_0.8 lattice_0.20-41
## [83] ProtGenerics_1.21.0 Matrix_1.2-18
## [85] vctrs_0.3.0 pillar_1.4.4
## [87] lifecycle_0.2.0 BiocManager_1.30.10
## [89] data.table_1.12.8 bitops_1.0-6
## [91] httpuv_1.5.2 R6_2.4.1
## [93] promises_1.1.0 codetools_0.2-16
## [95] MASS_7.3-53 gtools_3.8.2
## [97] assertthat_0.2.1 rhdf5_2.33.0
## [99] SummarizedExperiment_1.19.2 openssl_1.4.1
## [101] GenomicScores_2.1.0 withr_2.2.0
## [103] GenomicAlignments_1.25.0 Rsamtools_2.5.0
## [105] GenomeInfoDbData_1.2.3 hms_0.5.3
## [107] grid_4.0.3 rpart_4.1-15
## [109] timeDate_3043.102 class_7.3-17
## [111] rmarkdown_2.1 pROC_1.16.2
## [113] shiny_1.4.0.2 lubridate_1.7.8