In lab practical 4, our target is to screen for the thermal stable secondary structures on the transcript of human genome assembly hg19. We will first predict RNA structures using RNAfold from transcripts defined by the UCSC gene annotation on hg19. After the predictions are made by RNAfold, you will map the predicted RNA structure from the transcript coordinate into the genome coordinate. Further analysis and statistical tests will be conducted to compare the sequence content of different RNA structure regions on the genome.
The first step of this practical is to load the transcript annotation and genome sequence for human genome assembly hg19.
To achieve this, load the TxDb and BSgenome libraries for hg19 with library()
. Then, rename the BSgenome and TxDb object into the R variables: genome
and txdb
.
Please write your code in the specified area in the code chunk below.
Furthermore, you could choose to refer to the hint comments above the code region, particularly when you find this question is difficult for you.
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
genome <- BSgenome.Hsapiens.UCSC.hg19
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
Check that if ‘txdb’ and ‘genome’ are based on the hg19 assembly:
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
genome
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: June 2013
## # release name: Genome Reference Consortium GRCh37.p13
## # 298 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chr19_gl949749_alt chr19_gl949750_alt chr19_gl949751_alt
## # chr19_gl949752_alt chr19_gl949753_alt chr20_gl383577_alt
## # chr21_gl383578_alt chr21_gl383579_alt chr21_gl383580_alt
## # chr21_gl383581_alt chr22_gl383582_alt chr22_gl383583_alt
## # chr22_kb663609_alt
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
The next step is to extract the exon regions for the RNA transcripts encoded in the hg19 transcript annotation.
You should use function exonsBy()
, and set the argument by = "tx"
, also, please set the argument use.names = TRUE
.
The input for exonsBy()
should be the TxDb object for transcript annotation, the output for exonsBy()
is a GRangesList
of exons grouped by the levels defined by by =
Store the output into the R variable named with transcripts
.
transcripts <- exonsBy(txdb, by = "tx", use.names = TRUE)
Check the layout of the resulting GRangesList
object:
transcripts
## GRangesList object of length 82960:
## $uc001aaa.3
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 11874-12227 + | 1 <NA> 1
## [2] chr1 12613-12721 + | 3 <NA> 2
## [3] chr1 13221-14409 + | 5 <NA> 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
##
## $uc010nxq.1
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 11874-12227 + | 1 <NA> 1
## [2] chr1 12595-12721 + | 2 <NA> 2
## [3] chr1 13403-14409 + | 6 <NA> 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
##
## $uc010nxr.1
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 11874-12227 + | 1 <NA> 1
## [2] chr1 12646-12697 + | 4 <NA> 2
## [3] chr1 13221-14409 + | 5 <NA> 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
##
## ...
## <82957 more elements>
You should now find the ranges of exons grouped by transcripts (the list elements).
The computational cost for the RNA 2ndary structures is expensive for long transcripts. Therefore, we need to subset the transcripts
GRangelist into the transcripts on chromosome 21 and with transcript width > 100 and <= 500.
P.S. Please use the variable name selectedTranscripts
after the filter.
transcripts <- transcripts[seqnames(transcripts) == "chr21"]
tx_width <- sum(width(transcripts))
selectedTranscripts <- transcripts[tx_width > 100 & tx_width <= 500]
Apply the function extractTranscriptSeqs()
to extract the DNA sequences of the mature RNA transcripts (no introns) from genomes. Please type ?extractTranscriptSeqs()
to learn the usage of the function. The sequence output should be stored into the variable named by tx_seqs
.
tx_seqs <- extractTranscriptSeqs(x = genome, transcripts = selectedTranscripts)
Check the outlook of tx_seqs
:
tx_seqs
## DNAStringSet object of length 69:
## width seq names
## [1] 180 CGCGACTGCGGCGGCGGTGGTGG...CCCGCCGGCCGCCTTTCTCGCG uc021wgv.1
## [2] 477 GTCTGAATTTTTTTTGCTTATTA...AATATTCTTGGTCATTCAGCAG uc021whv.1
## [3] 109 GTCTATGGCCATACCACCCTGAA...AGAAAAAAAGTTGATGTAAATA uc021whx.1
## [4] 312 TAATAACATCTACCAAACAGTTT...TGTGAAGATGAAGGTAGAAATT uc021wid.1
## [5] 483 ATGTCCTACAACTGCTGCTCTAG...CTTCTATCAATTCACCTGCTAA uc011acw.2
## ... ... ...
## [65] 273 ATCCTACGCTATGGAGTACAGAC...CACCCCACGCCCCAGTCGAAAG uc010gny.1
## [66] 492 ATCCCTGACTCGGGGTCGCCTTT...AATTAAAAGAGATCGATATTAA uc002zax.1
## [67] 103 GAGATGCTGAGCAGCAGAAGTGC...AGAATCAGAGGATGATCTGGGA uc002zfd.3
## [68] 447 AGACCAGCCCTGTCCTCTGCGCC...GCCAGCAGGCTCTCTGAACTCC uc002zfs.1
## [69] 339 TCAAGTCTTTCCGAGGGGCCAGT...GATTTTCACAGGTATCCAGCTT uc031rwc.1
# as.character(tx_seqs['uc002ytg.1']$uc002ytg.1)
uc002ytg.1
below.Now, save the DNAStringSet
object of variable tx_seqs
as a FASTA file on your disk, the FASTA file should be named by tx_seqs.fasta
.
This step is easy since the export of DNAStringSet is supported by the function writeXStringSet()
.
writeXStringSet(tx_seqs, "tx_seqs.fasta")
tx_seqs.fasta
file. Please try to explain the components of the observed format.>uc021wgv.1
is the name of this DNA sequence while the following sequence is the DNA sequence itself.RNAfold is a command-line tool in the Vienna RNA package to predict RNA secondary structures with RNA sequences. Hence, for this question, you need to call the Linux bash command within R.
RNAfold can be downloaded at here, and you could install it on your own computer. But by this time, on our Linux server, RNAfold has already been installed.
Please run RNAfold with the previously saved FASTA file tx_seq.fasta
.
Set the temperature of folding into 70 degrees with argument --temp=70
, and then save the standard output of RNAfold into a file named by RNAstructure.txt
.
Please notice that in Linux bash, we can channel the output of a bash command into a text file with >
mark, such as:
COMMAND_NAME --ARGUMENT=XXX INPUT_NAME.fasta > OUTPUT_NAME.txt
Also, you could call the system command line within R using the system()
function.
system("RNAfold --temp=70 tx_seqs.fasta > RNAstructure.txt")
Question 6.2, 5%: Open the pdf file for the predicted RNA structure of the transcript uc002ytg.1
, How many hairpin loops are presented in the RNA structure?
Your answer: 6
P.S. In case your computer cannot open .ps file, please use the ps2pdf converter.
Question 6.3, 5%: Copy the first 3 lines of the file RNAstructure.txt
, please also interpret the format.
Your answer:
>uc021wgv.1
CGCGACUGCGGCGGCGGUGGUGGGGGGAGCCGCGGGGAUCGCCGAGGGCCGGUCGGCCGCCCCGGGUGCCGCGCGGUGCCGCCGGC
GGCGGUGAGGCCCCGCGCGUGUGUCCCGGCUGCGGUCGGCCGCGCUCGAGGGGUCCCCGUGGCGUCCCCUUCCCCGCCGGCCGCCU
UUCUCGCG
The next step is to read the predicted RNA structures, which are stored using parentheses representation, into R. The basic idea for the code below is to match for the “(” or “.” in the output file of RNAfold. Then, for hybridized and non-hybridized regions, it will construct the string matching results into GRanges object on the transcript coordinate.
Please fill in the missing portion of the following code chunk to make it run correctly.
## Read the output file into R:
RNAfold_out <- readLines("RNAstructure.txt")
Struc_pred <- RNAfold_out[seq(3, by = 3, length.out = length(tx_seqs))]
## Remove the energy scores attached at the end:
Struc_pred <- gsub(" .*", "", Struc_pred)
Struc_pred <- BStringSet(Struc_pred)
names(Struc_pred) <- names(tx_seqs)
Struc_pred
## BStringSet object of length 69:
## width seq names
## [1] 180 (((((..((((((((((.((.((...)))))).))))).....))))) uc021wgv.1
## [2] 477 ..............((((................)))))))..)))). uc021whv.1
## [3] 109 ............((((.(((((.......................... uc021whx.1
## [4] 312 ........((((((......(((...))))))......)))))).... uc021wid.1
## [5] 483 ..........(((((((((((((......................... uc011acw.2
## ... ... ...
## [65] 273 .......((..(((.(((((......)...))).))............ uc010gny.1
## [66] 492 ............((((((.............................. uc002zax.1
## [67] 103 ......((((....((((.((((.......)))).............. uc002zfd.3
## [68] 447 .............((((.(((((...(((....)))).)))))..... uc002zfs.1
## [69] 339 .........((((((...((((....)..................... uc031rwc.1
## Construct the GRanges object for hybridized and nonhybridized regions:
nonHyb_irl <- lapply(vmatchPattern(".", Struc_pred), reduce)
Hyb_irl <- c(lapply(vmatchPattern("(", Struc_pred), reduce), lapply(vmatchPattern(")",Struc_pred), reduce))
##===== Your Code is finished till here ========##
##Convert the IrangesList into GRanges
irl2grl <- function(irl) GRangesList( mapply(function(x,y) GRanges(seqnames = y,
ranges = x,
strand = "*"),irl,names(irl)) )
Hyb_gr <- unlist(irl2grl(Hyb_irl))
nonHyb_gr <- unlist(irl2grl(nonHyb_irl))
Hyb_gr ##The Granges for Hybridized (Stem) regions on the transcript
## GRanges object with 1866 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## uc021wgv.1 uc021wgv.1 1-5 *
## uc021wgv.1 uc021wgv.1 8-17 *
## uc021wgv.1 uc021wgv.1 19-20 *
## uc021wgv.1 uc021wgv.1 22-27 *
## uc021wgv.1 uc021wgv.1 29-40 *
## ... ... ... ...
## uc031rwc.1 uc031rwc.1 251-254 *
## uc031rwc.1 uc031rwc.1 284-286 *
## uc031rwc.1 uc031rwc.1 297-299 *
## uc031rwc.1 uc031rwc.1 303-306 *
## uc031rwc.1 uc031rwc.1 313-318 *
## -------
## seqinfo: 69 sequences from an unspecified genome; no seqlengths
nonHyb_gr ##The Granges for non-Hybridized (Looped) regions on the transcript
## GRanges object with 1897 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## uc021wgv.1 uc021wgv.1 6-7 *
## uc021wgv.1 uc021wgv.1 18 *
## uc021wgv.1 uc021wgv.1 21 *
## uc021wgv.1 uc021wgv.1 28 *
## uc021wgv.1 uc021wgv.1 41 *
## ... ... ... ...
## uc031rwc.1 uc031rwc.1 280-283 *
## uc031rwc.1 uc031rwc.1 287-296 *
## uc031rwc.1 uc031rwc.1 300-302 *
## uc031rwc.1 uc031rwc.1 307-312 *
## uc031rwc.1 uc031rwc.1 319-339 *
## -------
## seqinfo: 69 sequences from an unspecified genome; no seqlengths
Question 7.2, 5%: How many hybridized regions and loop regions are predicted by RNAfold on the selected transcripts?
Your answer: There have 1866 hybridized regions and 1897 loop regions.
The next step is to map the GRanges from the transcript coordinates into Genome coordinates.
Whenever we need to do this kind of task in R, we generally rely on 2 powerful Bioconductor functions: mapFromTranscripts
and mapToTranscripts
. If our input range is on the transcript and the target range is on the genome, we should use mapFromTranscripts
, conversely, we should use mapTotranscripts
. Please check the help docomentation using ? + function name
for the detailed explanations of their usage.
Hyb_gr <- mapFromTranscripts(x = Hyb_gr, transcripts = selectedTranscripts)
nonHyb_gr <- mapFromTranscripts(x = nonHyb_gr, transcripts = selectedTranscripts)
##===== Your Code is finished till here ======= ##
Hyb_gr
## GRanges object with 1866 ranges and 2 metadata columns:
## seqnames ranges strand | xHits transcriptsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## uc021wgv.1 chr21 9825832-9825836 + | 1 1
## uc021wgv.1 chr21 9825839-9825848 + | 2 1
## uc021wgv.1 chr21 9825850-9825851 + | 3 1
## uc021wgv.1 chr21 9825853-9825858 + | 4 1
## uc021wgv.1 chr21 9825860-9825871 + | 5 1
## ... ... ... ... . ... ...
## uc031rwc.1 chr21 47705113-47705116 - | 1862 69
## uc031rwc.1 chr21 47705081-47705083 - | 1863 69
## uc031rwc.1 chr21 47705068-47705070 - | 1864 69
## uc031rwc.1 chr21 47705061-47705064 - | 1865 69
## uc031rwc.1 chr21 47705049-47705054 - | 1866 69
## -------
## seqinfo: 93 sequences from an unspecified genome; no seqlengths
nonHyb_gr
## GRanges object with 1897 ranges and 2 metadata columns:
## seqnames ranges strand | xHits transcriptsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## uc021wgv.1 chr21 9825837-9825838 + | 1 1
## uc021wgv.1 chr21 9825849 + | 2 1
## uc021wgv.1 chr21 9825852 + | 3 1
## uc021wgv.1 chr21 9825859 + | 4 1
## uc021wgv.1 chr21 9825872 + | 5 1
## ... ... ... ... . ... ...
## uc031rwc.1 chr21 47705084-47705087 - | 1893 69
## uc031rwc.1 chr21 47705071-47705080 - | 1894 69
## uc031rwc.1 chr21 47705065-47705067 - | 1895 69
## uc031rwc.1 chr21 47705055-47705060 - | 1896 69
## uc031rwc.1 chr21 47705028-47705048 - | 1897 69
## -------
## seqinfo: 93 sequences from an unspecified genome; no seqlengths
Question 8.2, 5%: How the seqnames
field of the GRanges changed before and after the transcript to genome mapping? please explain the reason behind this change.
Your answer: Changes from the genome name to chromosome name, because we just map the GRanges from the transcript coordinates into Genome coordinates.
In this step, we will extract the sequence content, particularly the GC content, behind the hybridized regions and the looped regions. GC content is defined as the proportion of G and C within a given nucleotide sequence.
Then, construct 2 vectors named by Hyb_GC
and nonHyb_GC
storing the GC contents for the previously predicted hybridized and non-hybridized regions.
You may need the help of the following functions in this step:
Views()
: for extracting sequences underneath GRanges on BSgenome
object.DNAStringSet()
: for the convertion of the BSgenome views into a set of DNA strings so that we could calculate their sequence content.letterFrequency()
:for the calculation of the proportion / frequency of nucleotides within XStringSet.Please check the documentation for these functions with ?
to get more clues.
Hyb_GC <- letterFrequency(x = DNAStringSet(Views(genome, Hyb_gr)), letters = "CG", as.prob=TRUE)
nonHyb_GC <- letterFrequency(x = DNAStringSet(Views(genome, nonHyb_gr)), letters = "CG", as.prob=TRUE)
Apply the R base function boxplot()
to draw a box plot based on the constructed data.frame plot_df
.
plot_df <- data.frame(GC = c(Hyb_GC, nonHyb_GC),
group = rep(c("Hyb", "nonHyb"),
c(length(Hyb_GC), length(nonHyb_GC))
))
boxplot(GC ~ group, data = plot_df)
# median(Hyb_GC)
# median(nonHyb_GC)
Question 10.2, 5%: What are the median values of GC contents for the hybridized region and non-hybridized region? Please explain why one group is higher than the other from the molecular perspective.
Your answer:
The median values of GC contents for the hybridized region is 0.6
The median values of GC contents for the non-hybridized region is 0.44
The reason why one group is higher than the other is that in the component of hybridized region there have higher DNA density than non-hybridized region, so that the GC content is higher than another.
Conduct the t.test and wilcox.test to test for the difference of sample means between hybridized GC contents and looped GC contents.
t.test(nonHyb_GC, Hyb_GC)
##
## Welch Two Sample t-test
##
## data: nonHyb_GC and Hyb_GC
## t = -18.441, df = 3374.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1768309 -0.1428428
## sample estimates:
## mean of x mean of y
## 0.4388010 0.5986378
wilcox.test(nonHyb_GC, Hyb_GC)
##
## Wilcoxon rank sum test with continuity correction
##
## data: nonHyb_GC and Hyb_GC
## W = 1129828, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [2] GenomicFeatures_1.41.0
## [3] AnnotationDbi_1.51.0
## [4] Biobase_2.49.0
## [5] BSgenome.Hsapiens.UCSC.hg19_1.4.3
## [6] BSgenome_1.57.0
## [7] rtracklayer_1.49.1
## [8] Biostrings_2.57.0
## [9] XVector_0.29.0
## [10] GenomicRanges_1.41.1
## [11] GenomeInfoDb_1.25.0
## [12] IRanges_2.23.4
## [13] S4Vectors_0.27.5
## [14] BiocGenerics_0.35.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 lattice_0.20-41
## [3] prettyunits_1.1.1 Rsamtools_2.5.0
## [5] assertthat_0.2.1 digest_0.6.25
## [7] BiocFileCache_1.13.0 R6_2.4.1
## [9] RSQLite_2.2.0 evaluate_0.14
## [11] httr_1.4.1 pillar_1.4.4
## [13] zlibbioc_1.35.0 rlang_0.4.6
## [15] progress_1.2.2 curl_4.3
## [17] blob_1.2.1 Matrix_1.2-18
## [19] rmarkdown_2.1 BiocParallel_1.23.0
## [21] stringr_1.4.0 RCurl_1.98-1.2
## [23] bit_1.1-15.2 biomaRt_2.45.0
## [25] DelayedArray_0.15.1 compiler_4.0.3
## [27] xfun_0.13 pkgconfig_2.0.3
## [29] askpass_1.1 htmltools_0.4.0
## [31] openssl_1.4.1 tidyselect_1.1.0
## [33] SummarizedExperiment_1.19.2 tibble_3.0.1
## [35] GenomeInfoDbData_1.2.3 matrixStats_0.56.0
## [37] XML_3.99-0.3 crayon_1.3.4
## [39] dplyr_0.8.5 dbplyr_1.4.3
## [41] rappdirs_0.3.1 GenomicAlignments_1.25.0
## [43] bitops_1.0-6 grid_4.0.3
## [45] lifecycle_0.2.0 DBI_1.1.0
## [47] magrittr_1.5 stringi_1.4.6
## [49] ellipsis_0.3.0 vctrs_0.3.0
## [51] tools_4.0.3 bit64_0.9-7
## [53] glue_1.4.0 purrr_0.3.4
## [55] hms_0.5.3 yaml_2.2.1
## [57] memoise_1.1.0 knitr_1.28