0. Practical guide for TAs and students

BIO214 will have 5 marked programming practicals that will together account for 30% of the final mark. The Rstudio Server is the required platform for completing these practicals, and students can log in using the account and URL provided on LMO.

Each practical will have a sample HTML file that contains the correct programming output. To complete the programming assignment, students need to fill in the gaps labeled with ___ in the RMD file and obtain the expected result.

Students will receive full scores for their programming questions if their code runs through and yields the standard output. Short answer questions are also included in the programming questions, and students need to type their answers at ___ and submit them together with the knitted report.

Once students have completed all the questions, they need to click the “Knit” button to generate an HTML report. If the report is generated successfully, they should download the HTML file by selecting the file and clicking “More” > “Export”. The submission of the practicals is done by uploading the HTML report to the assignment box on Learning Mall. The deadlines for submissions are set for two weeks after the corresponding lab schedule.

If students encounter coding problems, they can seek help from the following websites: Google, StackOverflow, and Biostars. These websites are excellent resources for proactive learning since most of their skills are developed when they struggle and solve problems in practice.

1. Linux bash basics (Not Marked)

The first lab practical aims to achieve raw read download, read QC, read trimming, read alignment, and read quantification. However, before implementing the entire pipeline, it is necessary to become familiar with several basic bash commands.

To view the output for each command, you can click the rightmost button in the code chunk labeled “Run Current Chunk.” Alternatively, students can enter the bash commands in the Terminal next to the Console. The Terminal is the more standard way of working on Linux/Unix systems.

For this practical, you are required to type the bash command in the code chunk. It is essential to try to understand the code comments (after #) below because knowing bash commands is a fundamental skill for a bioinformatician.

## pwd can retrieve your current working directory
pwd 
## /home/Huangtianchi.Zhu19/BIO214/LabPractical1
## ls can list all the files/folders in your current working directory
ls 
## LabPractical1.html
## LabPractical1_Model.html
## LabPractical1.Rmd
## SRR1039508_filtered.bam
## SRR1039508.sam
## SRR1039508_sub_1.fastq
## SRR1039508_sub_1.fastq_trimming_report.txt
## SRR1039508_sub_1_val_1.fq
## SRR1039508_sub_2.fastq
## SRR1039508_sub_2_fastqc.html
## SRR1039508_sub_2_fastqc.zip
## SRR1039508_sub_2.fastq_trimming_report.txt
## SRR1039508_sub_2_val_2.fq
## SRR1039508_subset.bam
## SRR1039508_subset.bam.bai
## SRR1039508_unfiltered.bam
## Adding argument after ls can return more file information (file size, date of change, & file access)
ls -lh 
## total 39M
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 712K Mar 17 18:15 LabPractical1.html
## -rw-r--rw- 1 root               root               685K Mar  2 19:15 LabPractical1_Model.html
## -rw-r--rw- 1 root               root                20K Mar 17 22:30 LabPractical1.Rmd
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.5M Mar 17 18:14 SRR1039508_filtered.bam
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19  11M Mar 17 18:14 SRR1039508.sam
## -rw-r--rw- 1 root               root               5.1M Mar  2 19:15 SRR1039508_sub_1.fastq
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.0K Mar 17 18:14 SRR1039508_sub_1.fastq_trimming_report.txt
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 5.1M Mar 17 18:14 SRR1039508_sub_1_val_1.fq
## -rw-r--rw- 1 root               root               5.1M Mar  2 19:15 SRR1039508_sub_2.fastq
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 620K Mar 17 18:14 SRR1039508_sub_2_fastqc.html
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 309K Mar 17 18:14 SRR1039508_sub_2_fastqc.zip
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.3K Mar 17 18:14 SRR1039508_sub_2.fastq_trimming_report.txt
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 5.1M Mar 17 18:14 SRR1039508_sub_2_val_2.fq
## -rw-r--rw- 1 root               root               631K Mar  2 19:15 SRR1039508_subset.bam
## -rw-r--rw- 1 root               root               6.9K Mar  2 19:15 SRR1039508_subset.bam.bai
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.7M Mar 17 18:14 SRR1039508_unfiltered.bam
## Create a directory called testfolder with mkdir
mkdir ~/BIO214/LabPractical1/testfolder 

## Check the current directory to see the newly created folder
ls 
## LabPractical1.html
## LabPractical1_Model.html
## LabPractical1.Rmd
## SRR1039508_filtered.bam
## SRR1039508.sam
## SRR1039508_sub_1.fastq
## SRR1039508_sub_1.fastq_trimming_report.txt
## SRR1039508_sub_1_val_1.fq
## SRR1039508_sub_2.fastq
## SRR1039508_sub_2_fastqc.html
## SRR1039508_sub_2_fastqc.zip
## SRR1039508_sub_2.fastq_trimming_report.txt
## SRR1039508_sub_2_val_2.fq
## SRR1039508_subset.bam
## SRR1039508_subset.bam.bai
## SRR1039508_unfiltered.bam
## testfolder

~” at here is used to represent the home directory (/Users/YOUR_USER_NAME/).

One can also use “.” to represent the current directory, by doing so the mkdir command is changed to mkdir ./testfolder. However, it is always safe to type the “full directory” rather than use “.”, because it can avoid potential mistakes after changing the directory of the script.

## Create a text message with content 'This is my first file' and store it in a file named by 'HelloWorld.txt'
echo This is my first file > HelloWorld.txt
## Check the content of HelloWorld.txt
cat HelloWorld.txt 
## This is my first file
## Move the file toward the folder with mv
mv  HelloWorld.txt ~/BIO214/LabPractical1/testfolder

## Change the current working directory toward the folder with cd
cd ~/BIO214/LabPractical1/testfolder

## Re-check the existing files in current working directory
ls
## HelloWorld.txt
## Set the directory to ~/BIO214/LabPractical1
cd ~/BIO214/LabPractical1

## Remove the entire test folder with rm
rm -r ~/BIO214/LabPractical1/testfolder

## Re-check the content of the LabPractical1 directory, see that the testfolder has been removed
ls
## LabPractical1.html
## LabPractical1_Model.html
## LabPractical1.Rmd
## SRR1039508_filtered.bam
## SRR1039508.sam
## SRR1039508_sub_1.fastq
## SRR1039508_sub_1.fastq_trimming_report.txt
## SRR1039508_sub_1_val_1.fq
## SRR1039508_sub_2.fastq
## SRR1039508_sub_2_fastqc.html
## SRR1039508_sub_2_fastqc.zip
## SRR1039508_sub_2.fastq_trimming_report.txt
## SRR1039508_sub_2_val_2.fq
## SRR1039508_subset.bam
## SRR1039508_subset.bam.bai
## SRR1039508_unfiltered.bam

Please be careful when running rm, because everything deleted by rm is NOT recoverable.

To get more knowledge on linux bash usage, go visit this link.

2. Download and analyze RNA-Seq raw reads (10%)

Here we will download the RNA-Seq sample SRR1039508 from GEO.

# fastq-dump --split-3 SRR1039508

Adding argument --split-3 will ensure the paired-end library being split into 2 fastq files.

The download is time consuming, so we will skip it in this practice. If you accidentally type the download command in terminal, it will likely to take hours to finish. Please then use ‘Ctrl + C’ to cancel the running process in the terminal.

We can use nohup fastq-dump --split-3 SRR1039508 > SRR1039508_dld.out & to download the fastq file in the backstage. By doing so, multiple download processes can be submitted at once. P.S. nohup command > xxx.out & can be used to send any bash command to run at the backstage.

The files already presented in your LabPractical1 directory contains the first 20000 read pairs for each of the downloaded fastq files:

Please fill the following code chunk to check the first 12 lines of the file SRR1039508_sub_1.fastq.

head -n 12 SRR1039508_sub_1.fastq
## @SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## CATTGCTGATACCAANNNNNNNNGCATTCCTCAAGGTCTTCCTCCTTCCCTTACGGAATTACA
## +SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## HJJJJJJJJJJJJJJ########00?GHIJJJJJJJIJJJJJJJJJJJJJJJJJHHHFFFFFD
## @SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## CCCTGGACTGCTTCTTGAAAAGTGCCATCCAAACTCTATCTTTGGGGAGAGTATGATAGAGAT
## +SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## HJJJJJJJJJJJJJJJJJIIJIGHIJJJJJJJJJJJJJJJJJJJJJJGHHIDHIJJHHHHHHF
## @SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## TCGATCCATCGATTGGAAGGCACTGATCTGGACTGTCAGGTTGGTGGTCTTATTTGCAAGTCC
## +SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## HJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJGJJIGHIJJBGIJCGIAHIJHHHHHHHFFFFF

Here we see three reads, with names, sequence, +, then quality score.

We will try to find the position of the third read using a simple BLAST search to the human genome (build hg19 / GRCh37).

On the BLAST search page, under Choose Search Set > Database > choose Genome (GRCh37.p13 reference assembly…)

Question 2.1: The third read aligns uniquely to chromosome 16 of the human genome, falling in the second exon of the gene DHX38, which is on the plus strand (transcribed from left to right). What is the genomic position to which the first basepair of the third read aligns? (Use human genome build: hg19 / GRCh37)

Your Answer: 72,130,180

Next, check the first 12 lines of the file SRR1039508_sub_2.fastq.

head -n 12 SRR1039508_sub_2.fastq
## @SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## CAGATGAGGCGTGTTGGCCAGAGAGCCATTGTCAACAGCAGAGATGNNNNNNNNNNNNAATCC
## +SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## HJJJJJJJJJJHIIIJJJJJJJJJJJJJJJJJJJJJJJHIJIJHII#################
## @SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## TACTCCGGAGAACAGATGGGATTCCCTAGGAGACCCTTGAGGGAAAAGGGAGCCCCAATCTCT
## +SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## FJJJJJJJFHEHJJJHIIJJGGIIJJGIIJGJHJJJJJHGIJJIGIHHHHFFFDDDDDDDDDE
## @SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## TCGCTCTCTCCGTTTCAGGGAAGCCAGCAAGTCCAGTCCGAGTAATGAAGGGCGGGGAGCAGG
## +SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## HJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJIJJJJJJFHIJFHJJJJJIHHHFDDDDDDDDD

Question 2.2: The third read in the second file also aligns uniquely to chromosome 16 of the human genome, aligning to the minus strand, also in the second exon of the gene DHX38. What is the genomic position to which the first basepair of the third read in the second file aligns?

Your Answer: 72,130,081

Above we saw the two reads of a paired-end fragment, both aligning within an exon of a human gene.

For a strand-specific RNA-sequencing protocol, we could have seen this kind of alignment to a + strand gene:

[1st read + strand] … [2nd read - strand]

…or we could have seen the 1st read aligning on the right side:

[2nd read + strand] … [1st read - strand]

A strand-specific protocol means that we only observe fragments with the same strand as the gene.

However, many experiments, including the one we are examining are not strand-specific. This means that, for a plus strand gene, we will observe, in addition to the above two kinds of paired-end fragments, two more:

[1st read - strand] … [2nd read + strand]

[2nd read - strand] … [1st read + strand]

3. Run fastqc and Trim Galore (15%)

Subsequently, we are going to check the qualities of the fastq file SRR1039508_sub_2.fastq with fastqc.

Please type your code below; if you run it successfully, you can find the output a HTML report in the current directory.

fastqc SRR1039508_sub_2.fastq
## Started analysis of SRR1039508_sub_2.fastq
## Approx 5% complete for SRR1039508_sub_2.fastq
## Approx 10% complete for SRR1039508_sub_2.fastq
## Approx 15% complete for SRR1039508_sub_2.fastq
## Approx 20% complete for SRR1039508_sub_2.fastq
## Approx 25% complete for SRR1039508_sub_2.fastq
## Approx 30% complete for SRR1039508_sub_2.fastq
## Approx 35% complete for SRR1039508_sub_2.fastq
## Approx 40% complete for SRR1039508_sub_2.fastq
## Approx 45% complete for SRR1039508_sub_2.fastq
## Approx 50% complete for SRR1039508_sub_2.fastq
## Approx 55% complete for SRR1039508_sub_2.fastq
## Approx 60% complete for SRR1039508_sub_2.fastq
## Approx 65% complete for SRR1039508_sub_2.fastq
## Approx 70% complete for SRR1039508_sub_2.fastq
## Approx 75% complete for SRR1039508_sub_2.fastq
## Approx 85% complete for SRR1039508_sub_2.fastq
## Approx 90% complete for SRR1039508_sub_2.fastq
## Approx 100% complete for SRR1039508_sub_2.fastq
## Analysis complete for SRR1039508_sub_2.fastq

The directory can be navigated with the “Files” panel on the right. Click and open the HTML report in a new page.

Question 3.1: Which basepair has the lowest 10% quality score?

Your Answer: 63

Question 3.2: What base is most common in the first position of the read?

Your Answer: T

Next, using the code demonstrated in the course slides, apply Trim Galore to cut the paired end reads.

trim_galore --paired SRR1039508_sub_1.fastq SRR1039508_sub_2.fastq

Question 3.3: What percentage of the read-pairs are dropped after running trim_galore? (infer it from trim_galore stdout)

Your Answer: 0.92%

4. Perform genome alignment with hisat2 (15%)

The hisat2 index of hg19 can be directly downloaded from the hisat2 webpage

I store the previously downloaded index under the directory of /data/hg19_idx/genome.

Next, you need to align the pairs of “trimmed reads” using the code demonstrated in the lecture, name the output SAM file as SRR1039508.sam

hisat2 -x /data/hg19_idx/genome -1 SRR1039508_sub_1_val_1.fq -2 SRR1039508_sub_2_val_2.fq -S SRR1039508.sam 
## 19815 reads; of these:
##   19815 (100.00%) were paired; of these:
##     796 (4.02%) aligned concordantly 0 times
##     18162 (91.66%) aligned concordantly exactly 1 time
##     857 (4.33%) aligned concordantly >1 times
##     ----
##     796 pairs aligned concordantly 0 times; of these:
##       152 (19.10%) aligned discordantly 1 time
##     ----
##     644 pairs aligned 0 times concordantly or discordantly; of these:
##       1288 mates make up the pairs; of these:
##         699 (54.27%) aligned 0 times
##         512 (39.75%) aligned exactly 1 time
##         77 (5.98%) aligned >1 times
## 98.24% overall alignment rate

The mapping efficiency is a crucial statistics for NGS data analysis; a high mapping rate (> 80 %) usually indicate a fine library quality.

It is defined as: \[\text{Mapping efficiency}= \frac{\text{# of uniquely best mapped reads}}{\text{# of total reads}} \]

In case of the alignment of paired-end reads, the numerator of interest is the number of uniquely best concordant aligned reads.

Question 4.1: Please enter the mapping efficiency returned by the hisat2 report.

Your Answer: 98.24%

5. SAM file (20%)

The SAM format and its compressed version, BAM, contains information about the alignment of sequences to a reference genome.

Section 1.3 describes the header of the SAM file.

Check the head and end of the SAM file generated in last section with head and tail:

head SRR1039508.sam
tail SRR1039508.sam
## @HD  VN:1.0  SO:unsorted
## @SQ  SN:chr1 LN:249250621
## @SQ  SN:chr10    LN:135534747
## @SQ  SN:chr11    LN:135006516
## @SQ  SN:chr11_gl000202_random    LN:40103
## @SQ  SN:chr12    LN:133851895
## @SQ  SN:chr13    LN:115169878
## @SQ  SN:chr14    LN:107349540
## @SQ  SN:chr15    LN:102531392
## @SQ  SN:chr16    LN:90354753
## SRR1039508.19996 99  chr10   45953638    60  62M =   45953676    101 TTCCTTCAGCTCTTCATGGATGTCCAGGAAAATGACAACCCGCACACAATCAGACGTGAATG  FE@FFGEGGIIAHDHI>BE@DDEEEBFHGIIIEG@FHGED@FGIGIHGIIIIBACABB?BB@  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:62 YS:i:0  YT:Z:CP NH:i:1
## SRR1039508.19996 147 chr10   45953676    60  63M =   45953638    -101    CCCGCACACAATCAGACGTGAATGATTTCTGCTCCAGTGTCTTCAGGCTCTGTGCAACAAGAA DFHHA>EDCEGGGCFFC==@GB8?HIJHGHGAGFD>GGCEIGGDDGHGF?FAHHGCFCFHHFB AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP NH:i:1
## SRR1039508.19997 83  chr9    4741096 60  63M =   4740970 -189    ACTGAGGCCCGCACCGCGCGGGTACCAGGGCTTTGGCCTGGCCTGCGCGCTCACCCGCTCGGC ADCBDBBBDBD<5DDDDDDDDC>3DBCDDDDDDDDDDDDDDDDDDDHHIIGH@JJJJJJJJJH AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP NH:i:1
## SRR1039508.19997 163 chr9    4740970 60  63M =   4741096 189 CCCCGCTGGAGAGGTGCTTCAGCTCGAAGTGTGTAGTGATGCGCGACGACACGGTGCCCTTGC HJJJJJJJJHIHJJBGHJJJJJJJJJHIJDG@GHIIGIIHHHHFFDDDDDDDDD@DBDDDDDD AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP NH:i:1
## SRR1039508.19998 83  chr7    140154370   60  62M =   140154314   -118    CATCTTCAGAGTCTGTTAGTTCGTCGTCCCCACCTGCAGCCAAAAGCATAAGCAACATCTCG  ::BBBB@:((<3>AAA>2B?>DB=EA55</0(B?3AD?:<FFFGFFGCGBED@?HFC1E1BF  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:62 YS:i:0  YT:Z:CP NH:i:1
## SRR1039508.19998 163 chr7    140154314   60  63M =   140154370   118 TTGCTATAGATCCAAGTCATAAAAATCTTCCAGCTCATCATGAAACAAGTCCCACTCATCTTC :??EFGGGG+AFBHD?9?C>DDFHJID<HIIJAEGIEIFHCF@EBFGCG@DHGGGHGHHHHHH AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP NH:i:1
## SRR1039508.19999 83  chr10   64957302    60  22M949N41M  =   64957229    -136    GCTCTTTTCCAGGCAATTTTGGCATCCTTTTTCACCCAGGACAAAGCTGTTTTTTCAGATGTT FFFFHHHHHJJJJJJJJJJIJJJJJJJJJJIGGDJJJJJHGJJJJJJJJJJJJJJJJJJJJJH AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP XS:A:-  NH:i:1
## SRR1039508.19999 163 chr10   64957229    60  63M =   64957302    136 TTTTTGGCAGACCCAGTGAATGTTAAACAATGTTGCTTCACATGCATCACACATCTCCCGGAC HJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJJHIJJJJHIJJJJJJJJJJJJJJJJJJJHHFF AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP NH:i:1
## SRR1039508.20000 99  chr7    128399866   60  41M7603N22M =   128407696   290 TAAGAATGCTGATGGTTTCATTGATCTAGAAGAGTATATTGGTGACATGTACAGCCATGATGG HJJJJJJJJJJIJJJHJJJJJJIJJJJJJIJJJJFHIJJJJJIIJJIJJIIIIJIJJJJIJJJ AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP XS:A:+  NH:i:1
## SRR1039508.20000 147 chr7    128407696   60  14M1407N49M =   128399866   -290    CAGACCAAAACAAGGATGGCAAGCTTACCAAGGAGGAGATCGTTGACAAGTATGACTTATTTG FHHHHIJJJHJJJJJJJJJJJJIJIJHJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJH AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:63 YS:i:0  YT:Z:CP XS:A:+  NH:i:1

Question 5.1: What is the length of chr10 according to the SAM file?

Answer: 135,534,747

The CIGAR string tells us how the read aligned to the genome, using numbers and letters, M for match, I for insertion, D for deletion, N for skipped bases on the reference genome.

Question 5.2: What is the CIGAR string for the last read? Try to interpret on the presence of a long gap region reflected by the CIGAR string.

Answer: chr11_gl000202_random A CIGAR string is a way of representing how a query sequence aligns to a reference genome1. It uses different operators to indicate matches, mismatches, insertions, deletions and gaps2. A long gap region in a CIGAR string means that there is a large distance between two aligned segments of the query sequence on the reference genome3. This could be due to splicing events or structural variations3.

Bitwise FLAG is an integer which encodes possible YES/NO answers to the choices below. The bitwise FLAG is most easily decoded at this link, Explain SAM Flags, for the PICARD software.

Question 5.3: Refer to the PICARD webpage, explain the SAM FLAG of the last line of read.

Answer: 147 read paired, read mapped in proper pair, read reverse strand, second in pair

SAM can be converted into BAM using samtools:

samtools view -b SRR1039508.sam -o SRR1039508_unfiltered.bam

A filter for FLAG score can be passed with -F INT argument, and the output will only include reads with none of the FLAGS in INT present.

Findout the FLAG score that have the following attributes, and then pass it to the following code chunk.

samtools view -F 3852 -b SRR1039508.sam -o SRR1039508_filtered.bam
ls -l SRR1039508_unfiltered.bam
ls -l SRR1039508_filtered.bam
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2777669 Mar 17 22:31 SRR1039508_unfiltered.bam
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2598575 Mar 17 22:31 SRR1039508_filtered.bam

Question 5.4: Compare the file sizes (use ls -l) with and without the FLAG filter, what proportion of the reads are dropped?

Answer: 179,094

6. IGV (20%)

In this assessment we will use IGV to visualize the BAM file SRR1039508_subset.bam under your current directory. Make sure to download the BAM and its index (+.bai) into your own computer. The download will be triggered by selecting the file and clicking More > Export.

IGV can be downloaded from this link.

Once you have started IGV, you may have to click “More…” under genomes to get hg19.

Zoom in to gene SRM which is within the range chr1:11,112,000-11,122,000.

Question 6.1: Which strand is this gene: plus or minus?

Answer: minus

Question 6.2: In the first exon (rightmost) there is an alternate allele (C) to the reference (T). What is the genomic location of this SNP? (The number of the basepair only, don’t include the chromosome.)

Answer: 11,119,978

Look up the region around the SNP: chr1:11,119,891-11,119,910 in the UCSC Genome Browser (hg19).

You can right click and “Show details” for objects in the UCSC Genome Browser.

This appears to be a known and synonymous variant (meaning it does not change the amino acid sequence in the resulting protein because both codons code for the same amino acid: S (TCA) –> S (TCG).

Back in IGV, right click on the BAM file track and choose Squished. We want to count how many reads are in a particular intron. First we will make sure that we see all the possible reads.

Click View > Preferences > Alignments tab. At the top, make sure that “Downsample reads” is NOT checked. Later if you are using IGV for your own work, you may want to check this again, as it speeds up the visualization. But for this question we want to see all the reads.

Question 6.3: How many reads are in the first intron (the rightmost intron)? (Just count by eye the number of reads falling entirely in this intron.)

Answer: 4

7. Quantify gene expression level (20%)

In this question, we want to quantify gene expression levels by counting reads from a set of aligned RNA-Seq samples. The codes in the chunk bellow are R codes, as we will use the SummarizedOverlaps function in R to perform read count.

library(airway) #The airway package is used here since it contains a collection of exemplar RNA-seq BAM files
dir <- system.file("extdata", package="airway", mustWork=TRUE)  
list.files(dir) #To see all the example data stored in airway package
##  [1] "GSE52778_series_matrix.txt"        "Homo_sapiens.GRCh37.75_subset.gtf"
##  [3] "quants"                            "sample_table.csv"                 
##  [5] "SraRunInfo_SRP033351.csv"          "SRR1039508_subset.bam"            
##  [7] "SRR1039509_subset.bam"             "SRR1039512_subset.bam"            
##  [9] "SRR1039513_subset.bam"             "SRR1039516_subset.bam"            
## [11] "SRR1039517_subset.bam"             "SRR1039520_subset.bam"            
## [13] "SRR1039521_subset.bam"
csvfile <- file.path(dir,"sample_table.csv")  
(sampleTable <- read.csv(csvfile,row.names=1)) #Load and examine the table for the sample information  
##            SampleName    cell   dex albut        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357
## SRR1039521 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358
##               Sample    BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677
filenames <- file.path(dir, paste0(sampleTable$Run, "_subset.bam")) 
file.exists(filenames) 
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

After checking the directories of the BAM files on the genome, we will use Rsamtools package to scan all the BAM files and combine them into a BamFileList object.

library(Rsamtools)  
bamfiles <- BamFileList(filenames, yieldSize=2000000) #yieldSize argument can control memory usage
seqinfo(bamfiles[1]) #Checking the sequence info stored in the first BAM files
## Seqinfo object with 84 sequences from an unspecified genome:
##   seqnames   seqlengths isCircular genome
##   1           249250621       <NA>   <NA>
##   10          135534747       <NA>   <NA>
##   11          135006516       <NA>   <NA>
##   12          133851895       <NA>   <NA>
##   13          115169878       <NA>   <NA>
##   ...               ...        ...    ...
##   GL000210.1      27682       <NA>   <NA>
##   GL000231.1      27386       <NA>   <NA>
##   GL000229.1      19913       <NA>   <NA>
##   GL000226.1      15008       <NA>   <NA>
##   GL000207.1       4262       <NA>   <NA>

Next, we will make an annotation object (TxDb) directly from the GTF files using makeTxDbFromGFF(). Again, the GRangesList containing the exons grouped by genes is extracted by the function exonsBy().

library(GenomicFeatures)
library(GenomicAlignments)
gtffile <- file.path(dir,"Homo_sapiens.GRCh37.75_subset.gtf") 
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character())) 
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: /usr/local/lib/R/site-library/airway/extdata/Homo_sapiens.GRCh37.75_subset.gtf
## # Organism: NA
## # Taxonomy ID: NA
## # miRBase build ID: NA
## # Genome: NA
## # Nb of transcripts: 65
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2023-03-17 22:31:26 +0800 (Fri, 17 Mar 2023)
## # GenomicFeatures version at creation time: 1.41.0
## # RSQLite version at creation time: 2.2.0
## # DBSCHEMAVERSION: 1.2
(exbg <- exonsBy(txdb, by="gene")) #GrangesList object stores the gene features
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        1 11086580-11087705      - |        98 ENSE00000818830
##    [2]        1 11090233-11090307      - |        99 ENSE00000472123
##    [3]        1 11090805-11090939      - |       100 ENSE00000743084
##    [4]        1 11094885-11094963      - |       101 ENSE00000743085
##    [5]        1 11097750-11097868      - |       102 ENSE00003482788
##    ...      ...               ...    ... .       ...             ...
##   [14]        1 11106948-11107176      - |       111 ENSE00003467404
##   [15]        1 11106948-11107176      - |       112 ENSE00003489217
##   [16]        1 11107260-11107280      - |       113 ENSE00001833377
##   [17]        1 11107260-11107284      - |       114 ENSE00001472289
##   [18]        1 11107260-11107290      - |       115 ENSE00001881401
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## ...
## <19 more elements>
exbg #Try to understand the structure of this object; each row/ranges is an exon, and $`X` is a gene id.
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        1 11086580-11087705      - |        98 ENSE00000818830
##    [2]        1 11090233-11090307      - |        99 ENSE00000472123
##    [3]        1 11090805-11090939      - |       100 ENSE00000743084
##    [4]        1 11094885-11094963      - |       101 ENSE00000743085
##    [5]        1 11097750-11097868      - |       102 ENSE00003482788
##    ...      ...               ...    ... .       ...             ...
##   [14]        1 11106948-11107176      - |       111 ENSE00003467404
##   [15]        1 11106948-11107176      - |       112 ENSE00003489217
##   [16]        1 11107260-11107280      - |       113 ENSE00001833377
##   [17]        1 11107260-11107284      - |       114 ENSE00001472289
##   [18]        1 11107260-11107290      - |       115 ENSE00001881401
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## ...
## <19 more elements>

The BamFileList, together with the GRangesList for gene features is now passed to the summarizedOverlaps() function. The function is an R equivalent implementation of the HTSeq package. You can type ?summarizedOverlaps() in the console to see the documentation on its arguments.

se <- summarizeOverlaps(features=exbg,           # The gene features of interest, in our case exonic regions of genes.
                        reads=bamfiles,        # The BAM file list for the aligned reads
                        mode="Union",         # describes the overlapping rule applied during read count.
                        singleEnd=FALSE,     # Should the read being treated as single end
                        ignore.strand=TRUE, # Should the read count ignoring strands (FALSE for strand specific library only)
                        fragments=TRUE )  # Should read be counted as fragment (TRUE for paired-end only)

The output of the summarizedOverlaps() is a SummarizedExperiment object. It is a very powerful R object for genomics having a matrix form (associated with dim). The row features / gene features in SummarizedExperiment can be accessed with rowRanges(se), and the count matrix can be accessed with assay(se).

dim(se)               ## 20 genes, 8 samples / libraries
## [1] 20  8
head(assay(se), 3)    ## See the count of the first 3 rows
##                 SRR1039508_subset.bam SRR1039509_subset.bam
## ENSG00000009724                    38                    28
## ENSG00000116649                  1004                  1255
## ENSG00000120942                   218                   256
##                 SRR1039512_subset.bam SRR1039513_subset.bam
## ENSG00000009724                    66                    24
## ENSG00000116649                  1122                  1313
## ENSG00000120942                   233                   252
##                 SRR1039516_subset.bam SRR1039517_subset.bam
## ENSG00000009724                    42                    41
## ENSG00000116649                  1100                  1879
## ENSG00000120942                   269                   465
##                 SRR1039520_subset.bam SRR1039521_subset.bam
## ENSG00000009724                    47                    36
## ENSG00000116649                   745                  1536
## ENSG00000120942                   207                   400
colSums(assay(se))    ## Calculate sum of reads in each sample / library
## SRR1039508_subset.bam SRR1039509_subset.bam SRR1039512_subset.bam 
##                  6478                  6501                  7699 
## SRR1039513_subset.bam SRR1039516_subset.bam SRR1039517_subset.bam 
##                  6801                  8009                 10849 
## SRR1039520_subset.bam SRR1039521_subset.bam 
##                  5254                  9168
rowRanges(se)         ## The GRangesList for the row gene features
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        1 11086580-11087705      - |        98 ENSE00000818830
##    [2]        1 11090233-11090307      - |        99 ENSE00000472123
##    [3]        1 11090805-11090939      - |       100 ENSE00000743084
##    [4]        1 11094885-11094963      - |       101 ENSE00000743085
##    [5]        1 11097750-11097868      - |       102 ENSE00003482788
##    ...      ...               ...    ... .       ...             ...
##   [14]        1 11106948-11107176      - |       111 ENSE00003467404
##   [15]        1 11106948-11107176      - |       112 ENSE00003489217
##   [16]        1 11107260-11107280      - |       113 ENSE00001833377
##   [17]        1 11107260-11107284      - |       114 ENSE00001472289
##   [18]        1 11107260-11107290      - |       115 ENSE00001881401
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## ...
## <19 more elements>
colData(se)   ## The table for the experimental design (column data table)
## DataFrame with 8 rows and 0 columns
(colData(se) <- DataFrame(sampleTable))
## DataFrame with 8 rows and 9 columns
##             SampleName        cell         dex       albut         Run
##            <character> <character> <character> <character> <character>
## SRR1039508  GSM1275862      N61311       untrt       untrt  SRR1039508
## SRR1039509  GSM1275863      N61311         trt       untrt  SRR1039509
## SRR1039512  GSM1275866     N052611       untrt       untrt  SRR1039512
## SRR1039513  GSM1275867     N052611         trt       untrt  SRR1039513
## SRR1039516  GSM1275870     N080611       untrt       untrt  SRR1039516
## SRR1039517  GSM1275871     N080611         trt       untrt  SRR1039517
## SRR1039520  GSM1275874     N061011       untrt       untrt  SRR1039520
## SRR1039521  GSM1275875     N061011         trt       untrt  SRR1039521
##            avgLength  Experiment      Sample    BioSample
##            <integer> <character> <character>  <character>
## SRR1039508       126   SRX384345   SRS508568 SAMN02422669
## SRR1039509       126   SRX384346   SRS508567 SAMN02422675
## SRR1039512       126   SRX384349   SRS508571 SAMN02422678
## SRR1039513        87   SRX384350   SRS508572 SAMN02422670
## SRR1039516       120   SRX384353   SRS508575 SAMN02422682
## SRR1039517       126   SRX384354   SRS508576 SAMN02422673
## SRR1039520       101   SRX384357   SRS508579 SAMN02422683
## SRR1039521        98   SRX384358   SRS508580 SAMN02422677

The quantified gene expression values are usually normalized in the forms of FPKM and RPKM.

The FPKM (Fragment Per Kb per Million Fragments) value for a gene feature is defined by the following equation: \[\text{FPKM}=\frac{\text{Fragment count of the gene feature} \times 10^3\times10^6}{\text{Total fragment count of the library} \times \text{Length of the gene feature}}\]

Likewise, the RPKM (Read Per Kb per Million Reads) value for a gene feature is described by the equation below: \[\text{RPKM}=\frac{\text{Read count of the gene feature} \times 10^3\times10^6}{\text{Total read count of the library} \times \text{Length of the gene feature}}\]

Using the count assay generated above, compute the matrix of FPKM value, save it in the FPKM variable. Hint: The length of the gene features (exonic regions of genes) in the GRangeList of exons grouped by genes can be calculated by sum(width(GRangeList)).

You can write more than 1 line of code in "___" if you want.

FPKM <- (assay(se)*(10^9))/(sum(width(exbg)) %o% colSums(assay(se)))
head(FPKM)
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513  SRR1039516
## ENSG00000009724  1115.4227   818.9826  1630.0707   671.0197   997.16684
## ENSG00000116649 41121.2807 51219.7462 38666.2807 51223.0948 36440.82948
## ENSG00000120942  5094.2116  5961.0289  4581.2399  5609.0485  5084.34974
## ENSG00000120948 49861.2314 37566.1356 51134.3320 27864.0415 51588.66933
## ENSG00000171819   217.0385  2703.3829   867.4351 28063.6906    43.88736
## ENSG00000171824 12345.5127 15218.0357 13328.1799 14221.9858 10847.35866
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000009724   718.6062  1701.0012   746.6632
## ENSG00000116649 45952.6894 37621.8430 44451.9148
## ENSG00000120942  6488.2067  5964.0559  6604.6045
## ENSG00000120948 40216.0441 49610.8032 25485.4217
## ENSG00000171819   323.9873   936.6033 40907.9338
## ENSG00000171824 11918.3714 13102.1277 15960.7325

Modify the parameters used in summarizeOverlaps(), so that the resulting count matrix can be used to calculate the RPKM. Afterward, calculate the RPKM value and save it in the RPKM variable.

se2 <- summarizeOverlaps(features=exbg, reads=bamfiles, 
                         mode="Union",
                         singleEnd=TRUE,
                         ignore.strand=TRUE,
                         fragments=FALSE)
RPKM <- (assay(se2)*(10^9))/(sum(width(exbg)) %o% colSums(assay(se2)))
head(RPKM)
##                 SRR1039508_subset.bam SRR1039509_subset.bam
## ENSG00000009724             1024.7184              804.9903
## ENSG00000116649            40397.4700            50324.6007
## ENSG00000120942             5009.5336             5859.1849
## ENSG00000120948            48961.1245            36817.8068
## ENSG00000171819              213.4308             2657.1957
## ENSG00000171824            13397.6392            16349.4812
##                 SRR1039512_subset.bam SRR1039513_subset.bam
## ENSG00000009724             1458.2072              589.7619
## ENSG00000116649            37980.7069            50178.5376
## ENSG00000120942             4517.7217             5503.0495
## ENSG00000120948            50204.8351            27312.0640
## ENSG00000171819              853.5766            27533.3468
## ENSG00000171824            14467.9357            15433.5100
##                 SRR1039516_subset.bam SRR1039517_subset.bam
## ENSG00000009724             910.31222              672.9454
## ENSG00000116649           35825.76608            45179.4393
## ENSG00000120942            4989.24302             6387.5276
## ENSG00000120948           50602.63362            39496.1119
## ENSG00000171819              43.14661              318.9599
## ENSG00000171824           12042.49410            13023.6910
##                 SRR1039520_subset.bam SRR1039521_subset.bam
## ENSG00000009724             1585.0304              681.2869
## ENSG00000116649            36951.7991            43529.8942
## ENSG00000120942             5869.6544             6467.9492
## ENSG00000120948            48660.5916            24895.1609
## ENSG00000171819              921.7784            40055.2616
## ENSG00000171824            14265.2391            17288.8793
hist(assay(se2)/assay(se)) #Draw a histogram of ratios between 2 count matrixes

hist(RPKM/FPKM)

Question 7: What have you discovered from the ratio of two count matrices (assays)? Please explain the observed pattern. If you divide the RPKM matrix by the FPKM matrix, would you expect to see the same phenomenon?

Answer: The images tend to be normally distributed, this suggests that the two sets of values are similar or exhibit a consistent scaling factor. In other words, the fold-change distribution will be centered around 1. Also, I divide the RPKM matrix by the FPKM matrix and then get the similar phenomenon.

SessionInfo:

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       Ubuntu 18.04.5 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Asia/Shanghai               
##  date     2023-03-17                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package              * version  date       lib source        
##  airway               * 1.10.0   2020-10-29 [2] Bioconductor  
##  AnnotationDbi        * 1.51.0   2020-04-27 [2] Bioconductor  
##  askpass                1.1      2019-01-13 [2] CRAN (R 4.0.0)
##  assertthat             0.2.1    2019-03-21 [2] CRAN (R 4.0.0)
##  Biobase              * 2.49.0   2020-04-27 [2] Bioconductor  
##  BiocFileCache          1.13.0   2020-04-27 [2] Bioconductor  
##  BiocGenerics         * 0.35.2   2020-05-06 [2] Bioconductor  
##  BiocParallel           1.23.0   2020-04-27 [2] Bioconductor  
##  biomaRt                2.45.0   2020-04-27 [2] Bioconductor  
##  Biostrings           * 2.57.0   2020-04-27 [2] Bioconductor  
##  bit                    1.1-15.2 2020-02-10 [2] CRAN (R 4.0.0)
##  bit64                  0.9-7    2017-05-08 [2] CRAN (R 4.0.0)
##  bitops                 1.0-6    2013-08-17 [2] CRAN (R 4.0.0)
##  blob                   1.2.1    2020-01-20 [2] CRAN (R 4.0.0)
##  cli                    2.0.2    2020-02-28 [2] CRAN (R 4.0.0)
##  crayon                 1.3.4    2017-09-16 [2] CRAN (R 4.0.0)
##  curl                   4.3      2019-12-02 [2] CRAN (R 4.0.0)
##  DBI                    1.1.0    2019-12-15 [2] CRAN (R 4.0.0)
##  dbplyr                 1.4.3    2020-04-19 [2] CRAN (R 4.0.0)
##  DelayedArray         * 0.15.1   2020-05-01 [2] Bioconductor  
##  digest                 0.6.25   2020-02-23 [2] CRAN (R 4.0.0)
##  dplyr                  0.8.5    2020-03-07 [2] CRAN (R 4.0.0)
##  ellipsis               0.3.0    2019-09-20 [2] CRAN (R 4.0.0)
##  evaluate               0.14     2019-05-28 [2] CRAN (R 4.0.0)
##  fansi                  0.4.1    2020-01-08 [2] CRAN (R 4.0.0)
##  GenomeInfoDb         * 1.25.0   2020-04-27 [2] Bioconductor  
##  GenomeInfoDbData       1.2.3    2020-05-12 [2] Bioconductor  
##  GenomicAlignments    * 1.25.0   2020-04-27 [2] Bioconductor  
##  GenomicFeatures      * 1.41.0   2020-04-27 [2] Bioconductor  
##  GenomicRanges        * 1.41.1   2020-05-02 [2] Bioconductor  
##  glue                   1.4.0    2020-04-03 [2] CRAN (R 4.0.0)
##  hms                    0.5.3    2020-01-08 [2] CRAN (R 4.0.0)
##  htmltools              0.4.0    2019-10-04 [2] CRAN (R 4.0.0)
##  httr                   1.4.1    2019-08-05 [2] CRAN (R 4.0.0)
##  IRanges              * 2.23.4   2020-05-03 [2] Bioconductor  
##  knitr                  1.28     2020-02-06 [2] CRAN (R 4.0.0)
##  lattice                0.20-41  2020-04-02 [4] CRAN (R 4.0.0)
##  lifecycle              0.2.0    2020-03-06 [2] CRAN (R 4.0.0)
##  magrittr               1.5      2014-11-22 [2] CRAN (R 4.0.0)
##  Matrix                 1.2-18   2019-11-27 [4] CRAN (R 4.0.0)
##  matrixStats          * 0.56.0   2020-03-13 [2] CRAN (R 4.0.0)
##  memoise                1.1.0    2017-04-21 [2] CRAN (R 4.0.0)
##  openssl                1.4.1    2019-07-18 [2] CRAN (R 4.0.0)
##  pillar                 1.4.4    2020-05-05 [2] CRAN (R 4.0.0)
##  pkgconfig              2.0.3    2019-09-22 [2] CRAN (R 4.0.0)
##  prettyunits            1.1.1    2020-01-24 [2] CRAN (R 4.0.0)
##  progress               1.2.2    2019-05-16 [2] CRAN (R 4.0.0)
##  purrr                  0.3.4    2020-04-17 [2] CRAN (R 4.0.0)
##  R6                     2.4.1    2019-11-12 [2] CRAN (R 4.0.0)
##  rappdirs               0.3.1    2016-03-28 [2] CRAN (R 4.0.0)
##  Rcpp                   1.0.4.6  2020-04-09 [2] CRAN (R 4.0.0)
##  RCurl                  1.98-1.2 2020-04-18 [2] CRAN (R 4.0.0)
##  rlang                  0.4.6    2020-05-02 [2] CRAN (R 4.0.0)
##  rmarkdown              2.1      2020-01-20 [2] CRAN (R 4.0.0)
##  Rsamtools            * 2.5.0    2020-04-27 [2] Bioconductor  
##  RSQLite                2.2.0    2020-01-07 [2] CRAN (R 4.0.0)
##  rtracklayer            1.49.1   2020-05-02 [2] Bioconductor  
##  S4Vectors            * 0.27.5   2020-05-06 [2] Bioconductor  
##  sessioninfo            1.1.1    2018-11-05 [2] CRAN (R 4.0.0)
##  stringi                1.4.6    2020-02-17 [2] CRAN (R 4.0.0)
##  stringr                1.4.0    2019-02-10 [2] CRAN (R 4.0.0)
##  SummarizedExperiment * 1.19.2   2020-05-01 [2] Bioconductor  
##  tibble                 3.0.1    2020-04-20 [2] CRAN (R 4.0.0)
##  tidyselect             1.1.0    2020-05-11 [2] CRAN (R 4.0.0)
##  vctrs                  0.3.0    2020-05-11 [2] CRAN (R 4.0.0)
##  withr                  2.2.0    2020-04-20 [2] CRAN (R 4.0.0)
##  xfun                   0.13     2020-04-13 [2] CRAN (R 4.0.0)
##  XML                    3.99-0.3 2020-01-20 [2] CRAN (R 4.0.0)
##  XVector              * 0.29.0   2020-04-27 [2] Bioconductor  
##  yaml                   2.2.1    2020-02-01 [2] CRAN (R 4.0.0)
##  zlibbioc               1.35.0   2020-04-27 [2] Bioconductor  
## 
## [1] /home/Huangtianchi.Zhu19/R/x86_64-pc-linux-gnu-library/4.0
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library