The outline for Lecture Practical 5

In Lecture Practical 5, our target is to perform comparative genomic analysis between large animal genomes using Ensembl and UCSC databases. The first part of Lecture Practical 5 extracts the orthologous mapping between genes defined in the Ensembl database. Subsequently, the association between the conservation and protein-coding status of different genes is examined. The second half of Lecture Practical 5 will use the liftOver function to convert the genomic locations of epigenetic makers between human and mouse. In the end, statistical analysis will be performed to compare the PhastCons scores under the genomic regions around the epigenetic modification sites.

1. EnsemblDb package

EnsDb can be understood as the Ensemble version of the TxDb object. The TxDb stores the transcript annotation of UCSC KnownGenes, while EnsDb stores the gene annotations from Ensemble. Compared to UCSC, Ensemble collects more genes, transcripts, and supplementary information (e.x. biotypes). All of the analysis performed in this section is based on the EnsemblDb package of GRCh38 (hg38).

The code below load the human EnsemblDb package:

## Load the EnsemblDb package for hg38:
library(EnsDb.Hsapiens.v86)
EnsDb.Hsapiens.v86
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.0
## |Creation time: Thu May 18 16:32:27 2017
## |ensembl_version: 86
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |taxonomy_id: 9606
## |genome_build: GRCh38
## |DBSCHEMAVERSION: 2.0
## | No. of genes: 63970.
## | No. of transcripts: 216741.
## |Protein data available.
## === Hint code, fill the "___"  ============= ##
#genes_hg38 <- ___(___)
## ===== Enter your code below  =============== ##
genes_hg38
## GRanges object with 63970 ranges and 6 metadata columns:
##                   seqnames            ranges strand |         gene_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENSG00000223972        1       11869-14409      + | ENSG00000223972
##   ENSG00000227232        1       14404-29570      - | ENSG00000227232
##   ENSG00000278267        1       17369-17436      - | ENSG00000278267
##   ENSG00000243485        1       29554-31109      + | ENSG00000243485
##   ENSG00000237613        1       34554-36081      - | ENSG00000237613
##               ...      ...               ...    ... .             ...
##   ENSG00000224240        Y 26549425-26549743      + | ENSG00000224240
##   ENSG00000227629        Y 26586642-26591601      - | ENSG00000227629
##   ENSG00000237917        Y 26594851-26634652      - | ENSG00000237917
##   ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514
##   ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857
##                     gene_name                       gene_biotype
##                   <character>                        <character>
##   ENSG00000223972     DDX11L1 transcribed_unprocessed_pseudogene
##   ENSG00000227232      WASH7P             unprocessed_pseudogene
##   ENSG00000278267   MIR6859-1                              miRNA
##   ENSG00000243485   MIR1302-2                            lincRNA
##   ENSG00000237613     FAM138A                            lincRNA
##               ...         ...                                ...
##   ENSG00000224240     CYCSP49               processed_pseudogene
##   ENSG00000227629  SLC25A15P1             unprocessed_pseudogene
##   ENSG00000237917     PARP4P1             unprocessed_pseudogene
##   ENSG00000231514     FAM58CP               processed_pseudogene
##   ENSG00000235857     CTBP2P1               processed_pseudogene
##                   seq_coord_system      symbol                       entrezid
##                        <character> <character>                         <list>
##   ENSG00000223972       chromosome     DDX11L1 100287596,100287102,727856,...
##   ENSG00000227232       chromosome      WASH7P                             NA
##   ENSG00000278267       chromosome   MIR6859-1                      102466751
##   ENSG00000243485       chromosome   MIR1302-2            105376912,100302278
##   ENSG00000237613       chromosome     FAM138A           654835,645520,641702
##               ...              ...         ...                            ...
##   ENSG00000224240       chromosome     CYCSP49                             NA
##   ENSG00000227629       chromosome  SLC25A15P1                             NA
##   ENSG00000237917       chromosome     PARP4P1                             NA
##   ENSG00000231514       chromosome     FAM58CP                             NA
##   ENSG00000235857       chromosome     CTBP2P1                             NA
##   -------
##   seqinfo: 357 sequences from GRCh38 genome

P.S. Just like a vector data type in R, the names of a GRanges object can be accessed with names(). Also, you can subset the GRanges by its names using [ ].

2. Retrieve the homologous genes of human in mouse

The othologous relationships between species are stored in the Ensembl database. To make an online query to the Ensembl database within R, the biomaRt package is required.

The mart object stored in m represents the connection to the human gene data sets on Ensembl database.

library(biomaRt)
m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
m
## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   Using the hsapiens_gene_ensembl dataset

The keys (ID types can be recognized for conversion) of Ensembl database can be accessed with: grep( "gene_id", listFilters(m)$name , value = TRUE).

The columns (ID types can be converted into) of Ensembl database can be accessed with: grep( "mmusculus", listAttributes(m)$name , value = TRUE).

grep( "gene_id", listFilters(m)$name , value = TRUE)
## [1] "ensembl_gene_id"         "ensembl_gene_id_version"
## [3] "entrezgene_id"           "wikigene_id"
grep( "mmusculus", listAttributes(m)$name , value = TRUE) #Subset the results by regular expressions to narrow the range of choice.
##  [1] "mmusculus_homolog_ensembl_gene"                
##  [2] "mmusculus_homolog_associated_gene_name"        
##  [3] "mmusculus_homolog_ensembl_peptide"             
##  [4] "mmusculus_homolog_chromosome"                  
##  [5] "mmusculus_homolog_chrom_start"                 
##  [6] "mmusculus_homolog_chrom_end"                   
##  [7] "mmusculus_homolog_canonical_transcript_protein"
##  [8] "mmusculus_homolog_subtype"                     
##  [9] "mmusculus_homolog_orthology_type"              
## [10] "mmusculus_homolog_perc_id"                     
## [11] "mmusculus_homolog_perc_id_r1"                  
## [12] "mmusculus_homolog_goc_score"                   
## [13] "mmusculus_homolog_wga_coverage"                
## [14] "mmusculus_homolog_orthology_confidence"

Within the listed choices of attributes and filters, a mapping can be made between human’s Ensembl gene id and the mouse homologous genes using the function getBM(). The result of the query is stored into map.

To understand more about the database accession in R, you may check the “annotation” section in the R functions Reference List (Attached under lab 1).

map <- getBM(mart = m,
             attributes = c("ensembl_gene_id","mmusculus_homolog_ensembl_gene"),
             filters = "ensembl_gene_id",
             values = names(genes_hg38))

Examine the mapping results with:

table(table(map$ensembl_gene_id)) #Check the frequencies of duplication in the query 
## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 61042   688   216   104    55    28    25    23    18    14     8     4     7 
##    14    15    16    17    18    19    20    21    23    25    27    28    29 
##    26     5    14     1     2     1     1     1     1     2     3     1     6 
##    30    58    70    93   105 
##     1     1     1    22     1

3. Association between the conserved genes and the protein coding genes

After getting the homologous map between human genes and mouse genes, we want to see if the conserved genes between human and mouse are more likely to be protein-coding genes.

The code chunk bellow extracts the human Ensembl gene ids for the conserved human genes (on mouse). The conserved human genes are defined by the genes having 1 or more homologous genes on mouse, and the result of the homologous gene cannot be "" (it means not having confirmed homologous on mouse genome).

After extracting the conserved human gene id, a dummy variable (TRUE or FALSE variable) column is attached to the metadata of the genes_hg38 GRanges object, and the additional metadata column is named by conserved_on_mm10. The dummy variable is TRUE if the corresponding gene is conserved in mouse, and FALSE otherwise.

conserved_human_id <- map$ensembl_gene_id[map$mmusculus_homolog_ensembl_gene != ""]
genes_hg38$conserved_on_mm10 <- names(genes_hg38) %in% conserved_human_id
table(genes_hg38$conserved_on_mm10) #Evaluate annotated metadata column with table()
## 
## FALSE  TRUE 
## 44831 19139

Compute a (2*2) contingency table from 2 logical vectors. Examine the contingency table, and then tests the association on the contingency table using the function fisher.test().

As we have learned in the 4th lecture, Fisher’s exact test can be viewed as the permutation test of a pair of equally-lengthed binary vectors. In other words, fixing one binary vector and permute the other can simulate the null distribution of a Fisher’s exact test. Therefore, the test is essentially detecting the non-random association of 1s and 0s on the corresponding positions of the vectors.

## === Hint code, fill the "___"  ============= ##
#coding_genes <- genes_hg38$___ == ___
#conserved_genes <- genes_hg38$conserved_on_mm10
#contingency_table <- table(___, ___)
#contingency_table
#fisher.test(___)
## ===== Enter your code below  =============== ##
##             conserved_genes
## coding_genes FALSE  TRUE
##        FALSE 40032  1653
##        TRUE   4799 17486
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  83.03389 93.60773
## sample estimates:
## odds ratio 
##   88.45853

4. Evolutionary conservation of epigenetic modification

The aim of this part is to compare the locus of an epigenetic marker over human and mouse.

We have the modification data in BED format on both the human and mouse genomes, and the files are stored under the directory of Lecture Practical 5. The names of the 2 files are: mod_hg38.bed and mod_mm10.bed.

Both the human and mouse data are the single based annotation sites on the genome, i.e. the wideth of the ranges are all equal to 1.

The import() function defined in rtracklayer package can be used to import the 2 BED files into R. The variable names of them should be mod_hg38_gr and mod_hg19_gr

library(rtracklayer)
mod_hg38_gr <- import("mod_hg38.bed")
mod_mm10_gr <- import("mod_mm10.bed")

We still need to use the genes_hg38 to analyze the modification data. However, the chromosome names of it is not in the UCSC style (the style used by the BED files), and we need to run the following code to convert it.

re_genes_hg38 <- keepStandardChromosomes(genes_hg38, pruning.mode = "coarse")
seqlevels(re_genes_hg38) = paste0("chr",seqlevels(re_genes_hg38))
seqlevels(re_genes_hg38)[23] <- "chrM"

Task 4.1: Perform a Fisher’s exact test for the association between the modified genes and the conserved genes. The index of the gene’s conservation is defined by the previous analysis. The %over% function (one of the intra-range method) is useful to reach this goal.

## === Hint code, fill the "___"  ============= ##
#modified_index <- re_genes_hg38 %over% ___
#conserved_index <- re_genes_hg38$conserved_on_mm10
#contingency_table <- table(___, ___)
#contingency_table
#fisher.test(___)
## ===== Enter your code below  =============== ##
##               conserved_index
## modified_index FALSE  TRUE
##          FALSE 30976  4072
##          TRUE   8553 15049
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  12.82807 13.96557
## sample estimates:
## odds ratio 
##   13.38389

5. LiftOver

LiftOver is a tool developed by UCSC to map genome coordinates between different genome assemblies by sequence homology. The functionality is realized by the index file (called “chain file”) generated through pair-wised genome alignment.

Under this section, we will first convert the coordinates of the human epigenetic modification sites into the “homologous” sites on mouse genome using liftOver. Then, we will compare the “lifted sites” with the experimentally determined modification sites on the mouse genome. If the number of overlapped sites between the 2 groups is more than the expectation of a random permutation, we can conclude that the epigenetic marker is conserved between human and mouse.

The chain files used for the coordinate conversion between genomes can be downloaded under this link: UCSC

The chain file to convert the hg38 into mm10 is named as: hg38ToMm10.over.chain. The file is already downloaded and saved under your project directory.

## === Hint code, fill the "___"  ============= ##
#chain <- ___
#mod_mm10_lift <- ___
## ===== Enter your code below  =============== ##
mod_mm10_lift
## GRangesList object of length 480506:
## [[1]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |           name     score
##          <Rle> <IRanges>  <Rle> |    <character> <numeric>
##   [1]    chr17  66119166      + | rmbase2_hg19_1         0
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[2]]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |           name     score
##          <Rle> <IRanges>  <Rle> |    <character> <numeric>
##   [1]    chr17  66119141      + | rmbase2_hg19_2         0
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## [[3]]
## GRanges object with 0 ranges and 2 metadata columns:
##    seqnames    ranges strand |        name     score
##       <Rle> <IRanges>  <Rle> | <character> <numeric>
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
## 
## ...
## <480503 more elements>

We will then check how many lifted sites can be overlapped with the mouse modification sites. The result can be clearly visualized using a Venn diagram.

A Venn diagram with 2 circles are drawn, the circle on the left is the mouse modification sites, and the circle on the right is the lifted human sites. The intersection is the overlapped sites between the lifted human sites and the mouse sites.

library(VennDiagram)
Overlap_index <- mod_mm10_lift %over% mod_mm10_gr
grid.newpage()
draw.pairwise.venn(area1 = length(mod_hg38_gr),
                   area2 = length(mod_mm10_gr),
                   cross.area = sum(Overlap_index), 
                   category = c("Mouse_mod","Human_mod"), 
                   lty = rep("blank",2), fill = c("light blue", "pink"), 
                   alpha = rep(0.5, 2), cat.pos = c(0,0), 
                   cat.dist = rep(0.025, 2), scaled = FALSE)

## (polygon[GRID.polygon.11], polygon[GRID.polygon.12], polygon[GRID.polygon.13], polygon[GRID.polygon.14], text[GRID.text.15], text[GRID.text.16], text[GRID.text.17], text[GRID.text.18], text[GRID.text.19])

Then, we will draw the same diagram after shifting the mod_mm10_lift toward the left by 100 bp.

## === Hint code, fill the "___"  ============= ##
#Overlap_index <- shift(___) %over% mod_mm10_gr
## ===== Enter your code below  =============== ##
grid.newpage()
draw.pairwise.venn(area1 = length(mod_hg38_gr),
                   area2 = length(mod_mm10_gr),
                   cross.area = sum(Overlap_index), 
                   category = c("Human_mod", "Mouse_mod"), 
                   lty = rep("blank",2), fill = c("light blue", "pink"), 
                   alpha = rep(0.5, 2), cat.pos = c(0,0), 
                   cat.dist = rep(0.025, 2), scaled = FALSE)

## (polygon[GRID.polygon.20], polygon[GRID.polygon.21], polygon[GRID.polygon.22], polygon[GRID.polygon.23], text[GRID.text.24], text[GRID.text.25], text[GRID.text.26], text[GRID.text.27], text[GRID.text.28])

6. PhastCons scores

PhastCons is a program to extrapolate the likelihood of evolutionary conservation on genomes. Given a phylogenetic tree, the PhastCons score is inferred by an HMM in multiple alignments. The value of the PhastCons score is between 0 - 1, which is the posterior probability of the high conservation state given the observed multiple alignment patterns among species.

Our target is to retrieve the average PhastCons score within the 51bp regions centered (i.e. 25 bp flanking) by the human modification sites on hg38. The same scores are retrieved based on the flanked sites left-shifted by 100bp. At last, the score distributions among the two groups are compared by analyzing their empirical CDFs.

We will use the package phastCons100way.UCSC.hg38 to extract the PhastCons scores on hg38. As the package name suggested, the conservation scores are calculated from a phylogenetic tree built from 100 species. The package exports a GSscores object under its own package name.

library(phastCons100way.UCSC.hg38)
## === Hint code, fill the "___"  ============= ##
#PCscore_mod_51 <- gscores(___, ___ + 25)$default
#PCscore_shifted_51 <- gscores(___, shift(___ + 25, 100))$default
## ===== Enter your code below  =============== ##

## ===== Your code is finished  =============== ##
library(ggplot2)
plot_df <- data.frame(PhastCons = c(PCscore_mod_51,PCscore_shifted_51),
                      Shifted = rep(c(F,T),each = length(mod_hg38_gr)))
ggplot(plot_df, aes(x = PhastCons,
                    colour = Shifted)) + 
      stat_ecdf(geom = "step") + 
      labs(y = "Quantiles", x = "PhastCons Scores") 

After running the code above, you will obtain the empirical CDFs for the shifted and original sites.

See the following web pages if you are not familiar with some of the terms:

At last, perform a KS test to compare the distributions of the orginal and shifted Phastcons scores. Please check the function ks.test() for more information.

## === Hint code, fill the "___"  ============= ##
#ks.test(___, ___)
## ===== Enter your code below  =============== ##
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  PCscore_mod_51 and PCscore_shifted_51
## D = 0.12361, p-value < 2.2e-16
## alternative hypothesis: two-sided

Session info

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.3.0                   phastCons100way.UCSC.hg38_3.7.1
##  [3] GenomicScores_2.1.0             VennDiagram_1.6.20             
##  [5] futile.logger_1.4.3             rtracklayer_1.49.1             
##  [7] biomaRt_2.45.0                  EnsDb.Hsapiens.v86_2.99.0      
##  [9] ensembldb_2.13.1                AnnotationFilter_1.13.0        
## [11] GenomicFeatures_1.41.0          AnnotationDbi_1.51.0           
## [13] Biobase_2.49.0                  GenomicRanges_1.41.1           
## [15] GenomeInfoDb_1.25.0             IRanges_2.23.4                 
## [17] S4Vectors_0.27.5                BiocGenerics_0.35.2            
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.21.0           bitops_1.0-6                 
##  [3] matrixStats_0.56.0            bit64_0.9-7                  
##  [5] progress_1.2.2                httr_1.4.1                   
##  [7] tools_4.0.0                   R6_2.4.1                     
##  [9] HDF5Array_1.17.0              colorspace_1.4-1             
## [11] DBI_1.1.0                     lazyeval_0.2.2               
## [13] withr_2.2.0                   tidyselect_1.1.0             
## [15] prettyunits_1.1.1             bit_1.1-15.2                 
## [17] curl_4.3                      compiler_4.0.0               
## [19] formatR_1.7                   DelayedArray_0.15.1          
## [21] labeling_0.3                  scales_1.1.1                 
## [23] askpass_1.1                   rappdirs_0.3.1               
## [25] stringr_1.4.0                 digest_0.6.25                
## [27] Rsamtools_2.5.0               rmarkdown_2.1                
## [29] XVector_0.29.0                pkgconfig_2.0.3              
## [31] htmltools_0.4.0               dbplyr_1.4.3                 
## [33] fastmap_1.0.1                 BSgenome_1.57.0              
## [35] rlang_0.4.6                   RSQLite_2.2.0                
## [37] shiny_1.4.0.2                 farver_2.0.3                 
## [39] BiocParallel_1.23.0           dplyr_0.8.5                  
## [41] RCurl_1.98-1.2                magrittr_1.5                 
## [43] GenomeInfoDbData_1.2.3        Matrix_1.2-18                
## [45] munsell_0.5.0                 Rcpp_1.0.4.6                 
## [47] Rhdf5lib_1.11.0               lifecycle_0.2.0              
## [49] stringi_1.4.6                 yaml_2.2.1                   
## [51] SummarizedExperiment_1.19.2   zlibbioc_1.35.0              
## [53] rhdf5_2.33.0                  BiocFileCache_1.13.0         
## [55] AnnotationHub_2.21.0          blob_1.2.1                   
## [57] promises_1.1.0                crayon_1.3.4                 
## [59] lattice_0.20-41               Biostrings_2.57.0            
## [61] hms_0.5.3                     knitr_1.28                   
## [63] pillar_1.4.4                  codetools_0.2-16             
## [65] futile.options_1.0.1          XML_3.99-0.3                 
## [67] glue_1.4.0                    BiocVersion_3.12.0           
## [69] evaluate_0.14                 lambda.r_1.2.4               
## [71] BiocManager_1.30.10           vctrs_0.3.0                  
## [73] httpuv_1.5.2                  gtable_0.3.0                 
## [75] openssl_1.4.1                 purrr_0.3.4                  
## [77] assertthat_0.2.1              xfun_0.13                    
## [79] mime_0.9                      xtable_1.8-4                 
## [81] later_1.0.0                   tibble_3.0.1                 
## [83] GenomicAlignments_1.25.0      memoise_1.1.0                
## [85] ellipsis_0.3.0                interactiveDisplayBase_1.27.0