In Lecture Practical 5, our target is to perform comparative genomic analysis between large animal genomes using Ensembl and UCSC databases. The first part of Lecture Practical 5 extracts the orthologous mapping between genes defined in the Ensembl database. Subsequently, the association between the conservation and protein-coding status of different genes is examined. The second half of Lecture Practical 5 will use the liftOver function to convert the genomic locations of epigenetic makers between human and mouse. In the end, statistical analysis will be performed to compare the PhastCons scores under the genomic regions around the epigenetic modification sites.
EnsDb
can be understood as the Ensemble version of the TxDb
object. The TxDb stores the transcript annotation of UCSC KnownGenes, while EnsDb stores the gene annotations from Ensemble. Compared to UCSC, Ensemble collects more genes, transcripts, and supplementary information (e.x. biotypes). All of the analysis performed in this section is based on the EnsemblDb package of GRCh38 (hg38).
The code below load the human EnsemblDb package:
## Load the EnsemblDb package for hg38:
library(EnsDb.Hsapiens.v86)
EnsDb.Hsapiens.v86
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.0
## |Creation time: Thu May 18 16:32:27 2017
## |ensembl_version: 86
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |taxonomy_id: 9606
## |genome_build: GRCh38
## |DBSCHEMAVERSION: 2.0
## | No. of genes: 63970.
## | No. of transcripts: 216741.
## |Protein data available.
Question 1.1: Based on the printed message of the EnsDb object, tell the version of its genome build. How many genes and transcripts are defined by this package? Explain why there are more transcripts than genes in the human genome?
Answer:
Task 1.2: Extract GRanges
of genes from the EnsDb object using the function genes()
, and store the object into the variable named by genes_hg38
.
## === Hint code, fill the "___" ============= ##
#genes_hg38 <- ___(___)
## ===== Enter your code below =============== ##
genes_hg38
## GRanges object with 63970 ranges and 6 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 1 11869-14409 + | ENSG00000223972
## ENSG00000227232 1 14404-29570 - | ENSG00000227232
## ENSG00000278267 1 17369-17436 - | ENSG00000278267
## ENSG00000243485 1 29554-31109 + | ENSG00000243485
## ENSG00000237613 1 34554-36081 - | ENSG00000237613
## ... ... ... ... . ...
## ENSG00000224240 Y 26549425-26549743 + | ENSG00000224240
## ENSG00000227629 Y 26586642-26591601 - | ENSG00000227629
## ENSG00000237917 Y 26594851-26634652 - | ENSG00000237917
## ENSG00000231514 Y 26626520-26627159 - | ENSG00000231514
## ENSG00000235857 Y 56855244-56855488 + | ENSG00000235857
## gene_name gene_biotype
## <character> <character>
## ENSG00000223972 DDX11L1 transcribed_unprocessed_pseudogene
## ENSG00000227232 WASH7P unprocessed_pseudogene
## ENSG00000278267 MIR6859-1 miRNA
## ENSG00000243485 MIR1302-2 lincRNA
## ENSG00000237613 FAM138A lincRNA
## ... ... ...
## ENSG00000224240 CYCSP49 processed_pseudogene
## ENSG00000227629 SLC25A15P1 unprocessed_pseudogene
## ENSG00000237917 PARP4P1 unprocessed_pseudogene
## ENSG00000231514 FAM58CP processed_pseudogene
## ENSG00000235857 CTBP2P1 processed_pseudogene
## seq_coord_system symbol entrezid
## <character> <character> <list>
## ENSG00000223972 chromosome DDX11L1 100287596,100287102,727856,...
## ENSG00000227232 chromosome WASH7P NA
## ENSG00000278267 chromosome MIR6859-1 102466751
## ENSG00000243485 chromosome MIR1302-2 105376912,100302278
## ENSG00000237613 chromosome FAM138A 654835,645520,641702
## ... ... ... ...
## ENSG00000224240 chromosome CYCSP49 NA
## ENSG00000227629 chromosome SLC25A15P1 NA
## ENSG00000237917 chromosome PARP4P1 NA
## ENSG00000231514 chromosome FAM58CP NA
## ENSG00000235857 chromosome CTBP2P1 NA
## -------
## seqinfo: 357 sequences from GRCh38 genome
Question 1.3: Please try to interpret the gene_biotype
field in the metadata column.
Answer:
P.S. Just like a vector data type in R, the names of a GRanges object can be accessed with names()
. Also, you can subset the GRanges by its names using [ ]
.
The othologous relationships between species are stored in the Ensembl database. To make an online query to the Ensembl database within R, the biomaRt
package is required.
The mart
object stored in m
represents the connection to the human gene data sets on Ensembl database.
library(biomaRt)
m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
m
## Object of class 'Mart':
## Using the ENSEMBL_MART_ENSEMBL BioMart database
## Using the hsapiens_gene_ensembl dataset
The keys (ID types can be recognized for conversion) of Ensembl database can be accessed with: grep( "gene_id", listFilters(m)$name , value = TRUE)
.
The columns (ID types can be converted into) of Ensembl database can be accessed with: grep( "mmusculus", listAttributes(m)$name , value = TRUE)
.
grep( "gene_id", listFilters(m)$name , value = TRUE)
## [1] "ensembl_gene_id" "ensembl_gene_id_version"
## [3] "entrezgene_id" "wikigene_id"
grep( "mmusculus", listAttributes(m)$name , value = TRUE) #Subset the results by regular expressions to narrow the range of choice.
## [1] "mmusculus_homolog_ensembl_gene"
## [2] "mmusculus_homolog_associated_gene_name"
## [3] "mmusculus_homolog_ensembl_peptide"
## [4] "mmusculus_homolog_chromosome"
## [5] "mmusculus_homolog_chrom_start"
## [6] "mmusculus_homolog_chrom_end"
## [7] "mmusculus_homolog_canonical_transcript_protein"
## [8] "mmusculus_homolog_subtype"
## [9] "mmusculus_homolog_orthology_type"
## [10] "mmusculus_homolog_perc_id"
## [11] "mmusculus_homolog_perc_id_r1"
## [12] "mmusculus_homolog_goc_score"
## [13] "mmusculus_homolog_wga_coverage"
## [14] "mmusculus_homolog_orthology_confidence"
Within the listed choices of attributes and filters, a mapping can be made between human’s Ensembl gene id and the mouse homologous genes using the function getBM()
. The result of the query is stored into map
.
To understand more about the database accession in R, you may check the “annotation” section in the R functions Reference List
(Attached under lab 1).
map <- getBM(mart = m,
attributes = c("ensembl_gene_id","mmusculus_homolog_ensembl_gene"),
filters = "ensembl_gene_id",
values = names(genes_hg38))
Examine the mapping results with:
table(table(map$ensembl_gene_id)) #Check the frequencies of duplication in the query
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 61042 688 216 104 55 28 25 23 18 14 8 4 7
## 14 15 16 17 18 19 20 21 23 25 27 28 29
## 26 5 14 1 2 1 1 1 1 2 3 1 6
## 30 58 70 93 105
## 1 1 1 22 1
Question 2.1: Is the mapping between human genes and the mouse homologous genes injective (one-to-one) or not? If not, why some human genes can have more than one homologous genes in mouse?
Answer:
After getting the homologous map between human genes and mouse genes, we want to see if the conserved genes between human and mouse are more likely to be protein-coding genes.
The code chunk bellow extracts the human Ensembl gene ids for the conserved human genes (on mouse). The conserved human genes are defined by the genes having 1 or more homologous genes on mouse, and the result of the homologous gene cannot be "" (it means not having confirmed homologous on mouse genome).
After extracting the conserved human gene id, a dummy variable (TRUE or FALSE variable) column is attached to the metadata of the genes_hg38
GRanges object, and the additional metadata column is named by conserved_on_mm10
. The dummy variable is TRUE if the corresponding gene is conserved in mouse, and FALSE otherwise.
conserved_human_id <- map$ensembl_gene_id[map$mmusculus_homolog_ensembl_gene != ""]
genes_hg38$conserved_on_mm10 <- names(genes_hg38) %in% conserved_human_id
table(genes_hg38$conserved_on_mm10) #Evaluate annotated metadata column with table()
##
## FALSE TRUE
## 44831 19139
Question 3.1: How many human genes are conserved on mouse genome?
Answer:
Task 3.2: Create the following 2 dummy vectors with equal lengths.
coding_genes
: an indicator variable for the metadata column of gene_biotype
equals “protein_coding”.
conserved_genes
: a copy of the metadata column of the conserved_on_mm10
.
Compute a (2*2) contingency table from 2 logical vectors. Examine the contingency table, and then tests the association on the contingency table using the function fisher.test()
.
As we have learned in the 4th lecture, Fisher’s exact test can be viewed as the permutation test of a pair of equally-lengthed binary vectors. In other words, fixing one binary vector and permute the other can simulate the null distribution of a Fisher’s exact test. Therefore, the test is essentially detecting the non-random association of 1s and 0s on the corresponding positions of the vectors.
## === Hint code, fill the "___" ============= ##
#coding_genes <- genes_hg38$___ == ___
#conserved_genes <- genes_hg38$conserved_on_mm10
#contingency_table <- table(___, ___)
#contingency_table
#fisher.test(___)
## ===== Enter your code below =============== ##
## conserved_genes
## coding_genes FALSE TRUE
## FALSE 40032 1653
## TRUE 4799 17486
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 83.03389 93.60773
## sample estimates:
## odds ratio
## 88.45853
Question 3.3: Interpret the result of the Fisher’s exact test. Is the protein-coding status associate with the conservation status across all human genes? If so, the association is in what direction and of what effect size (measured by the odds ratio)? How statistically significant is the observed association?
Answer:
The aim of this part is to compare the locus of an epigenetic marker over human and mouse.
We have the modification data in BED format on both the human and mouse genomes, and the files are stored under the directory of Lecture Practical 5. The names of the 2 files are: mod_hg38.bed
and mod_mm10.bed
.
Both the human and mouse data are the single based annotation sites on the genome, i.e. the wideth of the ranges are all equal to 1.
The import()
function defined in rtracklayer
package can be used to import the 2 BED files into R. The variable names of them should be mod_hg38_gr
and mod_hg19_gr
library(rtracklayer)
mod_hg38_gr <- import("mod_hg38.bed")
mod_mm10_gr <- import("mod_mm10.bed")
We still need to use the genes_hg38
to analyze the modification data. However, the chromosome names of it is not in the UCSC style (the style used by the BED files), and we need to run the following code to convert it.
re_genes_hg38 <- keepStandardChromosomes(genes_hg38, pruning.mode = "coarse")
seqlevels(re_genes_hg38) = paste0("chr",seqlevels(re_genes_hg38))
seqlevels(re_genes_hg38)[23] <- "chrM"
Task 4.1: Perform a Fisher’s exact test for the association between the modified genes and the conserved genes. The index of the gene’s conservation is defined by the previous analysis. The %over%
function (one of the intra-range method) is useful to reach this goal.
## === Hint code, fill the "___" ============= ##
#modified_index <- re_genes_hg38 %over% ___
#conserved_index <- re_genes_hg38$conserved_on_mm10
#contingency_table <- table(___, ___)
#contingency_table
#fisher.test(___)
## ===== Enter your code below =============== ##
## conserved_index
## modified_index FALSE TRUE
## FALSE 30976 4072
## TRUE 8553 15049
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 12.82807 13.96557
## sample estimates:
## odds ratio
## 13.38389
Question 4.2: Interpret the result of Fisher’s exact test. Is the epigenetic modification more likely to happen on the conserved genes or the unconserved genes?
Answer:
LiftOver is a tool developed by UCSC to map genome coordinates between different genome assemblies by sequence homology. The functionality is realized by the index file (called “chain file”) generated through pair-wised genome alignment.
Under this section, we will first convert the coordinates of the human epigenetic modification sites into the “homologous” sites on mouse genome using liftOver. Then, we will compare the “lifted sites” with the experimentally determined modification sites on the mouse genome. If the number of overlapped sites between the 2 groups is more than the expectation of a random permutation, we can conclude that the epigenetic marker is conserved between human and mouse.
The chain files used for the coordinate conversion between genomes can be downloaded under this link: UCSC
The chain file to convert the hg38 into mm10 is named as: hg38ToMm10.over.chain
. The file is already downloaded and saved under your project directory.
import.chain()
function in package rtracklayer
, then lift the human modification sites (mod_hg38_gr
) from hg38 to mm10 with function liftOver()
. Store the lifted sites into the variable of mod_mm10_lift
.## === Hint code, fill the "___" ============= ##
#chain <- ___
#mod_mm10_lift <- ___
## ===== Enter your code below =============== ##
mod_mm10_lift
## GRangesList object of length 480506:
## [[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr17 66119166 + | rmbase2_hg19_1 0
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[2]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr17 66119141 + | rmbase2_hg19_2 0
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## [[3]]
## GRanges object with 0 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
##
## ...
## <480503 more elements>
We will then check how many lifted sites can be overlapped with the mouse modification sites. The result can be clearly visualized using a Venn diagram.
A Venn diagram with 2 circles are drawn, the circle on the left is the mouse modification sites, and the circle on the right is the lifted human sites. The intersection is the overlapped sites between the lifted human sites and the mouse sites.
library(VennDiagram)
Overlap_index <- mod_mm10_lift %over% mod_mm10_gr
grid.newpage()
draw.pairwise.venn(area1 = length(mod_hg38_gr),
area2 = length(mod_mm10_gr),
cross.area = sum(Overlap_index),
category = c("Mouse_mod","Human_mod"),
lty = rep("blank",2), fill = c("light blue", "pink"),
alpha = rep(0.5, 2), cat.pos = c(0,0),
cat.dist = rep(0.025, 2), scaled = FALSE)
## (polygon[GRID.polygon.11], polygon[GRID.polygon.12], polygon[GRID.polygon.13], polygon[GRID.polygon.14], text[GRID.text.15], text[GRID.text.16], text[GRID.text.17], text[GRID.text.18], text[GRID.text.19])
Then, we will draw the same diagram after shifting the mod_mm10_lift
toward the left by 100 bp.
shift()
to move mod_mm10_lift
horizontally by 100 bp along the genome coordinate. Compute again the overlapping index based on the shifted ranges.## === Hint code, fill the "___" ============= ##
#Overlap_index <- shift(___) %over% mod_mm10_gr
## ===== Enter your code below =============== ##
grid.newpage()
draw.pairwise.venn(area1 = length(mod_hg38_gr),
area2 = length(mod_mm10_gr),
cross.area = sum(Overlap_index),
category = c("Human_mod", "Mouse_mod"),
lty = rep("blank",2), fill = c("light blue", "pink"),
alpha = rep(0.5, 2), cat.pos = c(0,0),
cat.dist = rep(0.025, 2), scaled = FALSE)
## (polygon[GRID.polygon.20], polygon[GRID.polygon.21], polygon[GRID.polygon.22], polygon[GRID.polygon.23], text[GRID.text.24], text[GRID.text.25], text[GRID.text.26], text[GRID.text.27], text[GRID.text.28])
Question 5.3: What is the overlap number between lifted human sites and mouse sites? How does the number changed after shifting the lifted human sites? Do you expect the change in the overlap number caused by chance? From an evolutionary perspective, what conclusion can be derived?
Answer:
PhastCons is a program to extrapolate the likelihood of evolutionary conservation on genomes. Given a phylogenetic tree, the PhastCons score is inferred by an HMM in multiple alignments. The value of the PhastCons score is between 0 - 1, which is the posterior probability of the high conservation state given the observed multiple alignment patterns among species.
Our target is to retrieve the average PhastCons score within the 51bp regions centered (i.e. 25 bp flanking) by the human modification sites on hg38. The same scores are retrieved based on the flanked sites left-shifted by 100bp. At last, the score distributions among the two groups are compared by analyzing their empirical CDFs.
We will use the package phastCons100way.UCSC.hg38
to extract the PhastCons scores on hg38. As the package name suggested, the conservation scores are calculated from a phylogenetic tree built from 100 species. The package exports a GSscores
object under its own package name.
gscores()
to retrieve the PhastCons scores from the GScores
object under the original and the shifted regions. Store the scores into 2 variables named by PCscore_mod_51
and PCscore_shifted_51
, respectively.library(phastCons100way.UCSC.hg38)
## === Hint code, fill the "___" ============= ##
#PCscore_mod_51 <- gscores(___, ___ + 25)$default
#PCscore_shifted_51 <- gscores(___, shift(___ + 25, 100))$default
## ===== Enter your code below =============== ##
## ===== Your code is finished =============== ##
library(ggplot2)
plot_df <- data.frame(PhastCons = c(PCscore_mod_51,PCscore_shifted_51),
Shifted = rep(c(F,T),each = length(mod_hg38_gr)))
ggplot(plot_df, aes(x = PhastCons,
colour = Shifted)) +
stat_ecdf(geom = "step") +
labs(y = "Quantiles", x = "PhastCons Scores")
After running the code above, you will obtain the empirical CDFs for the shifted and original sites.
Question 6.2: Is one distribution establishing first-degree stochastic dominance over the other? If Yes, which group (shifted or original) is dominating over the other? What could we conclude from this on the relationship between the 2 distributions?
Answer:
See the following web pages if you are not familiar with some of the terms:
At last, perform a KS test to compare the distributions of the orginal and shifted Phastcons scores. Please check the function ks.test()
for more information.
## === Hint code, fill the "___" ============= ##
#ks.test(___, ___)
## ===== Enter your code below =============== ##
##
## Two-sample Kolmogorov-Smirnov test
##
## data: PCscore_mod_51 and PCscore_shifted_51
## D = 0.12361, p-value < 2.2e-16
## alternative hypothesis: two-sided
Question 6.3: What null hypothesis can be rejected by the KS test if the p-value is less than 0.05? Relating to our data, what could we conclude from the outcome of the KS test?
Answer:
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.0 phastCons100way.UCSC.hg38_3.7.1
## [3] GenomicScores_2.1.0 VennDiagram_1.6.20
## [5] futile.logger_1.4.3 rtracklayer_1.49.1
## [7] biomaRt_2.45.0 EnsDb.Hsapiens.v86_2.99.0
## [9] ensembldb_2.13.1 AnnotationFilter_1.13.0
## [11] GenomicFeatures_1.41.0 AnnotationDbi_1.51.0
## [13] Biobase_2.49.0 GenomicRanges_1.41.1
## [15] GenomeInfoDb_1.25.0 IRanges_2.23.4
## [17] S4Vectors_0.27.5 BiocGenerics_0.35.2
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.21.0 bitops_1.0-6
## [3] matrixStats_0.56.0 bit64_0.9-7
## [5] progress_1.2.2 httr_1.4.1
## [7] tools_4.0.0 R6_2.4.1
## [9] HDF5Array_1.17.0 colorspace_1.4-1
## [11] DBI_1.1.0 lazyeval_0.2.2
## [13] withr_2.2.0 tidyselect_1.1.0
## [15] prettyunits_1.1.1 bit_1.1-15.2
## [17] curl_4.3 compiler_4.0.0
## [19] formatR_1.7 DelayedArray_0.15.1
## [21] labeling_0.3 scales_1.1.1
## [23] askpass_1.1 rappdirs_0.3.1
## [25] stringr_1.4.0 digest_0.6.25
## [27] Rsamtools_2.5.0 rmarkdown_2.1
## [29] XVector_0.29.0 pkgconfig_2.0.3
## [31] htmltools_0.4.0 dbplyr_1.4.3
## [33] fastmap_1.0.1 BSgenome_1.57.0
## [35] rlang_0.4.6 RSQLite_2.2.0
## [37] shiny_1.4.0.2 farver_2.0.3
## [39] BiocParallel_1.23.0 dplyr_0.8.5
## [41] RCurl_1.98-1.2 magrittr_1.5
## [43] GenomeInfoDbData_1.2.3 Matrix_1.2-18
## [45] munsell_0.5.0 Rcpp_1.0.4.6
## [47] Rhdf5lib_1.11.0 lifecycle_0.2.0
## [49] stringi_1.4.6 yaml_2.2.1
## [51] SummarizedExperiment_1.19.2 zlibbioc_1.35.0
## [53] rhdf5_2.33.0 BiocFileCache_1.13.0
## [55] AnnotationHub_2.21.0 blob_1.2.1
## [57] promises_1.1.0 crayon_1.3.4
## [59] lattice_0.20-41 Biostrings_2.57.0
## [61] hms_0.5.3 knitr_1.28
## [63] pillar_1.4.4 codetools_0.2-16
## [65] futile.options_1.0.1 XML_3.99-0.3
## [67] glue_1.4.0 BiocVersion_3.12.0
## [69] evaluate_0.14 lambda.r_1.2.4
## [71] BiocManager_1.30.10 vctrs_0.3.0
## [73] httpuv_1.5.2 gtable_0.3.0
## [75] openssl_1.4.1 purrr_0.3.4
## [77] assertthat_0.2.1 xfun_0.13
## [79] mime_0.9 xtable_1.8-4
## [81] later_1.0.0 tibble_3.0.1
## [83] GenomicAlignments_1.25.0 memoise_1.1.0
## [85] ellipsis_0.3.0 interactiveDisplayBase_1.27.0