0. The outline for lab practical 4:

In lab practical 4, our target is to screen for the thermal stable secondary structures on the transcript of human genome assembly hg19. We will first predict RNA structures using RNAfold from transcripts defined by the UCSC gene annotation on hg19. After the predictions are made by RNAfold, you will map the predicted RNA structure from the transcript coordinate into the genome coordinate. Further analysis and statistical tests will be conducted to compare the sequence content of different RNA structure regions on the genome.

1. Load the hg19 genome and TxDb:

The first step of this practical is to load the transcript annotation and genome sequence for human genome assembly hg19.

To achieve this, load the TxDb and BSgenome libraries for hg19 with library(). Then, rename the BSgenome and TxDb object into the R variables: genome and txdb.

Please write your code in the specified area in the code chunk below.

Furthermore, you could choose to refer to the hint comments above the code region, particularly when you find this question is difficult for you.

library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene) 
genome <- BSgenome.Hsapiens.UCSC.hg19
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

Check that if ‘txdb’ and ‘genome’ are based on the hg19 assembly:

txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
genome
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: June 2013
## # release name: Genome Reference Consortium GRCh37.p13
## # 298 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chr19_gl949749_alt    chr19_gl949750_alt    chr19_gl949751_alt   
## #   chr19_gl949752_alt    chr19_gl949753_alt    chr20_gl383577_alt   
## #   chr21_gl383578_alt    chr21_gl383579_alt    chr21_gl383580_alt   
## #   chr21_gl383581_alt    chr22_gl383582_alt    chr22_gl383583_alt   
## #   chr22_kb663609_alt                                               
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)

2. Extract the exonic ranges grouped by transcript from ‘txdb’:

The next step is to extract the exon regions for the RNA transcripts encoded in the hg19 transcript annotation.

You should use function exonsBy(), and set the argument by = "tx", also, please set the argument use.names = TRUE.

The input for exonsBy() should be the TxDb object for transcript annotation, the output for exonsBy() is a GRangesList of exons grouped by the levels defined by by = Store the output into the R variable named with transcripts.

transcripts <- exonsBy(txdb, by = "tx", use.names = TRUE)

Check the layout of the resulting GRangesList object:

transcripts
## GRangesList object of length 82960:
## $uc001aaa.3
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 11874-12227      + |         1        <NA>         1
##   [2]     chr1 12613-12721      + |         3        <NA>         2
##   [3]     chr1 13221-14409      + |         5        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $uc010nxq.1
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 11874-12227      + |         1        <NA>         1
##   [2]     chr1 12595-12721      + |         2        <NA>         2
##   [3]     chr1 13403-14409      + |         6        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $uc010nxr.1
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 11874-12227      + |         1        <NA>         1
##   [2]     chr1 12646-12697      + |         4        <NA>         2
##   [3]     chr1 13221-14409      + |         5        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## ...
## <82957 more elements>

You should now find the ranges of exons grouped by transcripts (the list elements).

3. Filter the transcripts by the chromosome number and the transcript lengths:

The computational cost for the RNA 2ndary structures is expensive for long transcripts. Therefore, we need to subset the transcripts GRangelist into the transcripts on chromosome 21 and with transcript width > 100 and <= 500.

P.S. Please use the variable name selectedTranscripts after the filter.

transcripts <- transcripts[seqnames(transcripts) == "chr21"]
tx_width <- sum(width(transcripts))
selectedTranscripts <- transcripts[tx_width > 100 & tx_width <= 500]

4. Extract the transcript sequences from the genome:

Apply the function extractTranscriptSeqs() to extract the DNA sequences of the mature RNA transcripts (no introns) from genomes. Please type ?extractTranscriptSeqs() to learn the usage of the function. The sequence output should be stored into the variable named by tx_seqs.

tx_seqs <- extractTranscriptSeqs(x = genome, transcripts = selectedTranscripts)

Check the outlook of tx_seqs:

tx_seqs
## DNAStringSet object of length 69:
##      width seq                                              names               
##  [1]   180 CGCGACTGCGGCGGCGGTGGTGG...CCCGCCGGCCGCCTTTCTCGCG uc021wgv.1
##  [2]   477 GTCTGAATTTTTTTTGCTTATTA...AATATTCTTGGTCATTCAGCAG uc021whv.1
##  [3]   109 GTCTATGGCCATACCACCCTGAA...AGAAAAAAAGTTGATGTAAATA uc021whx.1
##  [4]   312 TAATAACATCTACCAAACAGTTT...TGTGAAGATGAAGGTAGAAATT uc021wid.1
##  [5]   483 ATGTCCTACAACTGCTGCTCTAG...CTTCTATCAATTCACCTGCTAA uc011acw.2
##  ...   ... ...
## [65]   273 ATCCTACGCTATGGAGTACAGAC...CACCCCACGCCCCAGTCGAAAG uc010gny.1
## [66]   492 ATCCCTGACTCGGGGTCGCCTTT...AATTAAAAGAGATCGATATTAA uc002zax.1
## [67]   103 GAGATGCTGAGCAGCAGAAGTGC...AGAATCAGAGGATGATCTGGGA uc002zfd.3
## [68]   447 AGACCAGCCCTGTCCTCTGCGCC...GCCAGCAGGCTCTCTGAACTCC uc002zfs.1
## [69]   339 TCAAGTCTTTCCGAGGGGCCAGT...GATTTTCACAGGTATCCAGCTT uc031rwc.1
# as.character(tx_seqs['uc002ytg.1']$uc002ytg.1)

5. Save the sequences in the disk as fasta File:

Now, save the DNAStringSet object of variable tx_seqs as a FASTA file on your disk, the FASTA file should be named by tx_seqs.fasta.

This step is easy since the export of DNAStringSet is supported by the function writeXStringSet().

writeXStringSet(tx_seqs, "tx_seqs.fasta")

6. Run RNAfold for the fasta File with bash command on the Linux system:

RNAfold is a command-line tool in the Vienna RNA package to predict RNA secondary structures with RNA sequences. Hence, for this question, you need to call the Linux bash command within R.

RNAfold can be downloaded at here, and you could install it on your own computer. But by this time, on our Linux server, RNAfold has already been installed.

Please run RNAfold with the previously saved FASTA file tx_seq.fasta.

Set the temperature of folding into 70 degrees with argument --temp=70, and then save the standard output of RNAfold into a file named by RNAstructure.txt.

Please notice that in Linux bash, we can channel the output of a bash command into a text file with > mark, such as:

COMMAND_NAME --ARGUMENT=XXX INPUT_NAME.fasta > OUTPUT_NAME.txt

Also, you could call the system command line within R using the system() function.

system("RNAfold  --temp=70 tx_seqs.fasta > RNAstructure.txt")

P.S. In case your computer cannot open .ps file, please use the ps2pdf converter.

7. Read the output of RNAfold into GRanges:

The next step is to read the predicted RNA structures, which are stored using parentheses representation, into R. The basic idea for the code below is to match for the “(” or “.” in the output file of RNAfold. Then, for hybridized and non-hybridized regions, it will construct the string matching results into GRanges object on the transcript coordinate.

Please fill in the missing portion of the following code chunk to make it run correctly.

## Read the output file into R:
RNAfold_out <- readLines("RNAstructure.txt")
Struc_pred <- RNAfold_out[seq(3, by = 3, length.out = length(tx_seqs))]

## Remove the energy scores attached at the end:
Struc_pred <- gsub(" .*", "", Struc_pred)
Struc_pred <- BStringSet(Struc_pred)
names(Struc_pred) <- names(tx_seqs)
Struc_pred
## BStringSet object of length 69:
##      width seq                                              names               
##  [1]   180 (((((..((((((((((.((.((...)))))).))))).....))))) uc021wgv.1
##  [2]   477 ..............((((................)))))))..)))). uc021whv.1
##  [3]   109 ............((((.(((((.......................... uc021whx.1
##  [4]   312 ........((((((......(((...))))))......)))))).... uc021wid.1
##  [5]   483 ..........(((((((((((((......................... uc011acw.2
##  ...   ... ...
## [65]   273 .......((..(((.(((((......)...))).))............ uc010gny.1
## [66]   492 ............((((((.............................. uc002zax.1
## [67]   103 ......((((....((((.((((.......)))).............. uc002zfd.3
## [68]   447 .............((((.(((((...(((....)))).)))))..... uc002zfs.1
## [69]   339 .........((((((...((((....)..................... uc031rwc.1
## Construct the GRanges object for hybridized and nonhybridized regions:
nonHyb_irl  <- lapply(vmatchPattern(".", Struc_pred), reduce)
Hyb_irl <- c(lapply(vmatchPattern("(", Struc_pred), reduce), lapply(vmatchPattern(")",Struc_pred), reduce))

##===== Your Code is finished till here ========##

##Convert the IrangesList into GRanges
irl2grl <- function(irl) GRangesList( mapply(function(x,y) GRanges(seqnames = y,
                                                                   ranges = x,
                                                                   strand = "*"),irl,names(irl)) )

Hyb_gr <- unlist(irl2grl(Hyb_irl))
nonHyb_gr <- unlist(irl2grl(nonHyb_irl))

Hyb_gr ##The Granges for Hybridized (Stem) regions on the transcript
## GRanges object with 1866 ranges and 0 metadata columns:
##                seqnames    ranges strand
##                   <Rle> <IRanges>  <Rle>
##   uc021wgv.1 uc021wgv.1       1-5      *
##   uc021wgv.1 uc021wgv.1      8-17      *
##   uc021wgv.1 uc021wgv.1     19-20      *
##   uc021wgv.1 uc021wgv.1     22-27      *
##   uc021wgv.1 uc021wgv.1     29-40      *
##          ...        ...       ...    ...
##   uc031rwc.1 uc031rwc.1   251-254      *
##   uc031rwc.1 uc031rwc.1   284-286      *
##   uc031rwc.1 uc031rwc.1   297-299      *
##   uc031rwc.1 uc031rwc.1   303-306      *
##   uc031rwc.1 uc031rwc.1   313-318      *
##   -------
##   seqinfo: 69 sequences from an unspecified genome; no seqlengths
nonHyb_gr ##The Granges for non-Hybridized (Looped) regions on the transcript
## GRanges object with 1897 ranges and 0 metadata columns:
##                seqnames    ranges strand
##                   <Rle> <IRanges>  <Rle>
##   uc021wgv.1 uc021wgv.1       6-7      *
##   uc021wgv.1 uc021wgv.1        18      *
##   uc021wgv.1 uc021wgv.1        21      *
##   uc021wgv.1 uc021wgv.1        28      *
##   uc021wgv.1 uc021wgv.1        41      *
##          ...        ...       ...    ...
##   uc031rwc.1 uc031rwc.1   280-283      *
##   uc031rwc.1 uc031rwc.1   287-296      *
##   uc031rwc.1 uc031rwc.1   300-302      *
##   uc031rwc.1 uc031rwc.1   307-312      *
##   uc031rwc.1 uc031rwc.1   319-339      *
##   -------
##   seqinfo: 69 sequences from an unspecified genome; no seqlengths

8. Map from transcript coordinates to genome coordinates:

The next step is to map the GRanges from the transcript coordinates into Genome coordinates.

Whenever we need to do this kind of task in R, we generally rely on 2 powerful Bioconductor functions: mapFromTranscripts and mapToTranscripts. If our input range is on the transcript and the target range is on the genome, we should use mapFromTranscripts, conversely, we should use mapTotranscripts. Please check the help docomentation using ? + function name for the detailed explanations of their usage.

Hyb_gr <- mapFromTranscripts(x = Hyb_gr, transcripts = selectedTranscripts)
nonHyb_gr <- mapFromTranscripts(x = nonHyb_gr, transcripts = selectedTranscripts)
##===== Your Code is finished till here ======= ##
Hyb_gr
## GRanges object with 1866 ranges and 2 metadata columns:
##              seqnames            ranges strand |     xHits transcriptsHits
##                 <Rle>         <IRanges>  <Rle> | <integer>       <integer>
##   uc021wgv.1    chr21   9825832-9825836      + |         1               1
##   uc021wgv.1    chr21   9825839-9825848      + |         2               1
##   uc021wgv.1    chr21   9825850-9825851      + |         3               1
##   uc021wgv.1    chr21   9825853-9825858      + |         4               1
##   uc021wgv.1    chr21   9825860-9825871      + |         5               1
##          ...      ...               ...    ... .       ...             ...
##   uc031rwc.1    chr21 47705113-47705116      - |      1862              69
##   uc031rwc.1    chr21 47705081-47705083      - |      1863              69
##   uc031rwc.1    chr21 47705068-47705070      - |      1864              69
##   uc031rwc.1    chr21 47705061-47705064      - |      1865              69
##   uc031rwc.1    chr21 47705049-47705054      - |      1866              69
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
nonHyb_gr
## GRanges object with 1897 ranges and 2 metadata columns:
##              seqnames            ranges strand |     xHits transcriptsHits
##                 <Rle>         <IRanges>  <Rle> | <integer>       <integer>
##   uc021wgv.1    chr21   9825837-9825838      + |         1               1
##   uc021wgv.1    chr21           9825849      + |         2               1
##   uc021wgv.1    chr21           9825852      + |         3               1
##   uc021wgv.1    chr21           9825859      + |         4               1
##   uc021wgv.1    chr21           9825872      + |         5               1
##          ...      ...               ...    ... .       ...             ...
##   uc031rwc.1    chr21 47705084-47705087      - |      1893              69
##   uc031rwc.1    chr21 47705071-47705080      - |      1894              69
##   uc031rwc.1    chr21 47705065-47705067      - |      1895              69
##   uc031rwc.1    chr21 47705055-47705060      - |      1896              69
##   uc031rwc.1    chr21 47705028-47705048      - |      1897              69
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths

9. Extract the GC content under each regions:

In this step, we will extract the sequence content, particularly the GC content, behind the hybridized regions and the looped regions. GC content is defined as the proportion of G and C within a given nucleotide sequence.

Then, construct 2 vectors named by Hyb_GC and nonHyb_GC storing the GC contents for the previously predicted hybridized and non-hybridized regions.

You may need the help of the following functions in this step:

Please check the documentation for these functions with ? to get more clues.

Hyb_GC <- letterFrequency(x = DNAStringSet(Views(genome, Hyb_gr)), letters = "CG", as.prob=TRUE)
nonHyb_GC <- letterFrequency(x = DNAStringSet(Views(genome, nonHyb_gr)), letters = "CG", as.prob=TRUE)

10. Draw a box plot of the GC content stratified by regions:

Apply the R base function boxplot() to draw a box plot based on the constructed data.frame plot_df.

plot_df <- data.frame(GC = c(Hyb_GC, nonHyb_GC),
                      group = rep(c("Hyb", "nonHyb"),
                                  c(length(Hyb_GC), length(nonHyb_GC))
                                  ))
boxplot(GC ~ group, data = plot_df)

# median(Hyb_GC)
# median(nonHyb_GC)

11. Test the difference in means using t-test and wilcox-test:

Conduct the t.test and wilcox.test to test for the difference of sample means between hybridized GC contents and looped GC contents.

t.test(nonHyb_GC, Hyb_GC)
## 
##  Welch Two Sample t-test
## 
## data:  nonHyb_GC and Hyb_GC
## t = -18.441, df = 3374.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1768309 -0.1428428
## sample estimates:
## mean of x mean of y 
## 0.4388010 0.5986378
wilcox.test(nonHyb_GC, Hyb_GC)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  nonHyb_GC and Hyb_GC
## W = 1129828, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [2] GenomicFeatures_1.41.0                 
##  [3] AnnotationDbi_1.51.0                   
##  [4] Biobase_2.49.0                         
##  [5] BSgenome.Hsapiens.UCSC.hg19_1.4.3      
##  [6] BSgenome_1.57.0                        
##  [7] rtracklayer_1.49.1                     
##  [8] Biostrings_2.57.0                      
##  [9] XVector_0.29.0                         
## [10] GenomicRanges_1.41.1                   
## [11] GenomeInfoDb_1.25.0                    
## [12] IRanges_2.23.4                         
## [13] S4Vectors_0.27.5                       
## [14] BiocGenerics_0.35.2                    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6                lattice_0.20-41            
##  [3] prettyunits_1.1.1           Rsamtools_2.5.0            
##  [5] assertthat_0.2.1            digest_0.6.25              
##  [7] BiocFileCache_1.13.0        R6_2.4.1                   
##  [9] RSQLite_2.2.0               evaluate_0.14              
## [11] httr_1.4.1                  pillar_1.4.4               
## [13] zlibbioc_1.35.0             rlang_0.4.6                
## [15] progress_1.2.2              curl_4.3                   
## [17] blob_1.2.1                  Matrix_1.2-18              
## [19] rmarkdown_2.1               BiocParallel_1.23.0        
## [21] stringr_1.4.0               RCurl_1.98-1.2             
## [23] bit_1.1-15.2                biomaRt_2.45.0             
## [25] DelayedArray_0.15.1         compiler_4.0.3             
## [27] xfun_0.13                   pkgconfig_2.0.3            
## [29] askpass_1.1                 htmltools_0.4.0            
## [31] openssl_1.4.1               tidyselect_1.1.0           
## [33] SummarizedExperiment_1.19.2 tibble_3.0.1               
## [35] GenomeInfoDbData_1.2.3      matrixStats_0.56.0         
## [37] XML_3.99-0.3                crayon_1.3.4               
## [39] dplyr_0.8.5                 dbplyr_1.4.3               
## [41] rappdirs_0.3.1              GenomicAlignments_1.25.0   
## [43] bitops_1.0-6                grid_4.0.3                 
## [45] lifecycle_0.2.0             DBI_1.1.0                  
## [47] magrittr_1.5                stringi_1.4.6              
## [49] ellipsis_0.3.0              vctrs_0.3.0                
## [51] tools_4.0.3                 bit64_0.9-7                
## [53] glue_1.4.0                  purrr_0.3.4                
## [55] hms_0.5.3                   yaml_2.2.1                 
## [57] memoise_1.1.0               knitr_1.28