BIO214 will have 5 marked programming practicals that will together account for 30% of the final mark. The Rstudio Server is the required platform for completing these practicals, and students can log in using the account and URL provided on LMO.
Each practical will have a sample HTML file that contains the correct programming output. To complete the programming assignment, students need to fill in the gaps labeled with ___
in the RMD file and obtain the expected result.
Students will receive full scores for their programming questions if their code runs through and yields the standard output. Short answer questions are also included in the programming questions, and students need to type their answers at ___
and submit them together with the knitted report.
Once students have completed all the questions, they need to click the “Knit” button to generate an HTML report. If the report is generated successfully, they should download the HTML file by selecting the file and clicking “More” > “Export”. The submission of the practicals is done by uploading the HTML report to the assignment box on Learning Mall. The deadlines for submissions are set for two weeks after the corresponding lab schedule.
If students encounter coding problems, they can seek help from the following websites: Google, StackOverflow, and Biostars. These websites are excellent resources for proactive learning since most of their skills are developed when they struggle and solve problems in practice.
The first lab practical aims to achieve raw read download, read QC, read trimming, read alignment, and read quantification. However, before implementing the entire pipeline, it is necessary to become familiar with several basic bash commands.
To view the output for each command, you can click the rightmost button in the code chunk labeled “Run Current Chunk.” Alternatively, students can enter the bash commands in the Terminal next to the Console. The Terminal is the more standard way of working on Linux/Unix systems.
For this practical, you are required to type the bash command in the code chunk. It is essential to try to understand the code comments (after #) below because knowing bash commands is a fundamental skill for a bioinformatician.
## pwd can retrieve your current working directory
pwd
## /home/Huangtianchi.Zhu19/BIO214/LabPractical1
## ls can list all the files/folders in your current working directory
ls
## LabPractical1.html
## LabPractical1_Model.html
## LabPractical1.Rmd
## SRR1039508_filtered.bam
## SRR1039508.sam
## SRR1039508_sub_1.fastq
## SRR1039508_sub_1.fastq_trimming_report.txt
## SRR1039508_sub_1_val_1.fq
## SRR1039508_sub_2.fastq
## SRR1039508_sub_2_fastqc.html
## SRR1039508_sub_2_fastqc.zip
## SRR1039508_sub_2.fastq_trimming_report.txt
## SRR1039508_sub_2_val_2.fq
## SRR1039508_subset.bam
## SRR1039508_subset.bam.bai
## SRR1039508_unfiltered.bam
## Adding argument after ls can return more file information (file size, date of change, & file access)
ls -lh
## total 39M
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 712K Mar 17 18:15 LabPractical1.html
## -rw-r--rw- 1 root root 685K Mar 2 19:15 LabPractical1_Model.html
## -rw-r--rw- 1 root root 20K Mar 17 22:30 LabPractical1.Rmd
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.5M Mar 17 18:14 SRR1039508_filtered.bam
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 11M Mar 17 18:14 SRR1039508.sam
## -rw-r--rw- 1 root root 5.1M Mar 2 19:15 SRR1039508_sub_1.fastq
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.0K Mar 17 18:14 SRR1039508_sub_1.fastq_trimming_report.txt
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 5.1M Mar 17 18:14 SRR1039508_sub_1_val_1.fq
## -rw-r--rw- 1 root root 5.1M Mar 2 19:15 SRR1039508_sub_2.fastq
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 620K Mar 17 18:14 SRR1039508_sub_2_fastqc.html
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 309K Mar 17 18:14 SRR1039508_sub_2_fastqc.zip
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.3K Mar 17 18:14 SRR1039508_sub_2.fastq_trimming_report.txt
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 5.1M Mar 17 18:14 SRR1039508_sub_2_val_2.fq
## -rw-r--rw- 1 root root 631K Mar 2 19:15 SRR1039508_subset.bam
## -rw-r--rw- 1 root root 6.9K Mar 2 19:15 SRR1039508_subset.bam.bai
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2.7M Mar 17 18:14 SRR1039508_unfiltered.bam
## Create a directory called testfolder with mkdir
mkdir ~/BIO214/LabPractical1/testfolder
## Check the current directory to see the newly created folder
ls
## LabPractical1.html
## LabPractical1_Model.html
## LabPractical1.Rmd
## SRR1039508_filtered.bam
## SRR1039508.sam
## SRR1039508_sub_1.fastq
## SRR1039508_sub_1.fastq_trimming_report.txt
## SRR1039508_sub_1_val_1.fq
## SRR1039508_sub_2.fastq
## SRR1039508_sub_2_fastqc.html
## SRR1039508_sub_2_fastqc.zip
## SRR1039508_sub_2.fastq_trimming_report.txt
## SRR1039508_sub_2_val_2.fq
## SRR1039508_subset.bam
## SRR1039508_subset.bam.bai
## SRR1039508_unfiltered.bam
## testfolder
“~
” at here is used to represent the home directory (/Users/YOUR_USER_NAME/
).
One can also use “.
” to represent the current directory, by doing so the mkdir command is changed to mkdir ./testfolder
. However, it is always safe to type the “full directory” rather than use “.
”, because it can avoid potential mistakes after changing the directory of the script.
## Create a text message with content 'This is my first file' and store it in a file named by 'HelloWorld.txt'
echo This is my first file > HelloWorld.txt
## Check the content of HelloWorld.txt
cat HelloWorld.txt
## This is my first file
## Move the file toward the folder with mv
mv HelloWorld.txt ~/BIO214/LabPractical1/testfolder
## Change the current working directory toward the folder with cd
cd ~/BIO214/LabPractical1/testfolder
## Re-check the existing files in current working directory
ls
## HelloWorld.txt
## Set the directory to ~/BIO214/LabPractical1
cd ~/BIO214/LabPractical1
## Remove the entire test folder with rm
rm -r ~/BIO214/LabPractical1/testfolder
## Re-check the content of the LabPractical1 directory, see that the testfolder has been removed
ls
## LabPractical1.html
## LabPractical1_Model.html
## LabPractical1.Rmd
## SRR1039508_filtered.bam
## SRR1039508.sam
## SRR1039508_sub_1.fastq
## SRR1039508_sub_1.fastq_trimming_report.txt
## SRR1039508_sub_1_val_1.fq
## SRR1039508_sub_2.fastq
## SRR1039508_sub_2_fastqc.html
## SRR1039508_sub_2_fastqc.zip
## SRR1039508_sub_2.fastq_trimming_report.txt
## SRR1039508_sub_2_val_2.fq
## SRR1039508_subset.bam
## SRR1039508_subset.bam.bai
## SRR1039508_unfiltered.bam
Please be careful when running rm
, because everything deleted by rm
is NOT recoverable.
To get more knowledge on linux bash usage, go visit this link.
Here we will download the RNA-Seq sample SRR1039508 from GEO.
# fastq-dump --split-3 SRR1039508
Adding argument --split-3
will ensure the paired-end library being split into 2 fastq files.
The download is time consuming, so we will skip it in this practice. If you accidentally type the download command in terminal, it will likely to take hours to finish. Please then use ‘Ctrl + C’ to cancel the running process in the terminal.
We can use nohup fastq-dump --split-3 SRR1039508 > SRR1039508_dld.out &
to download the fastq file in the backstage. By doing so, multiple download processes can be submitted at once. P.S. nohup command > xxx.out &
can be used to send any bash command to run at the backstage.
The files already presented in your LabPractical1
directory contains the first 20000 read pairs for each of the downloaded fastq files:
SRR1039508_sub_1.fastq
is the forward read.
SRR1039508_sub_2.fastq
is the reverse read.
Please fill the following code chunk to check the first 12 lines of the file SRR1039508_sub_1.fastq
.
head -n 12 SRR1039508_sub_1.fastq
## @SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## CATTGCTGATACCAANNNNNNNNGCATTCCTCAAGGTCTTCCTCCTTCCCTTACGGAATTACA
## +SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## HJJJJJJJJJJJJJJ########00?GHIJJJJJJJIJJJJJJJJJJJJJJJJJHHHFFFFFD
## @SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## CCCTGGACTGCTTCTTGAAAAGTGCCATCCAAACTCTATCTTTGGGGAGAGTATGATAGAGAT
## +SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## HJJJJJJJJJJJJJJJJJIIJIGHIJJJJJJJJJJJJJJJJJJJJJJGHHIDHIJJHHHHHHF
## @SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## TCGATCCATCGATTGGAAGGCACTGATCTGGACTGTCAGGTTGGTGGTCTTATTTGCAAGTCC
## +SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## HJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJGJJIGHIJJBGIJCGIAHIJHHHHHHHFFFFF
Here we see three reads, with names, sequence, +, then quality score.
We will try to find the position of the third read using a simple BLAST search to the human genome (build hg19 / GRCh37).
On the BLAST search page, under Choose Search Set > Database > choose Genome (GRCh37.p13 reference assembly…)
Question 2.1: The third read aligns uniquely to chromosome 16 of the human genome, falling in the second exon of the gene DHX38, which is on the plus strand (transcribed from left to right). What is the genomic position to which the first basepair of the third read aligns? (Use human genome build: hg19 / GRCh37)
Your Answer: 72,130,180
Next, check the first 12 lines of the file SRR1039508_sub_2.fastq
.
head -n 12 SRR1039508_sub_2.fastq
## @SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## CAGATGAGGCGTGTTGGCCAGAGAGCCATTGTCAACAGCAGAGATGNNNNNNNNNNNNAATCC
## +SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
## HJJJJJJJJJJHIIIJJJJJJJJJJJJJJJJJJJJJJJHIJIJHII#################
## @SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## TACTCCGGAGAACAGATGGGATTCCCTAGGAGACCCTTGAGGGAAAAGGGAGCCCCAATCTCT
## +SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131 length=63
## FJJJJJJJFHEHJJJHIIJJGGIIJJGIIJGJHJJJJJHGIJJIGIHHHHFFFDDDDDDDDDE
## @SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## TCGCTCTCTCCGTTTCAGGGAAGCCAGCAAGTCCAGTCCGAGTAATGAAGGGCGGGGAGCAGG
## +SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163 length=63
## HJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJIJJJJJJFHIJFHJJJJJIHHHFDDDDDDDDD
Question 2.2: The third read in the second file also aligns uniquely to chromosome 16 of the human genome, aligning to the minus strand, also in the second exon of the gene DHX38. What is the genomic position to which the first basepair of the third read in the second file aligns?
Your Answer: 72,130,081
Above we saw the two reads of a paired-end fragment, both aligning within an exon of a human gene.
For a strand-specific RNA-sequencing protocol, we could have seen this kind of alignment to a + strand gene:
[1st read + strand] … [2nd read - strand]
…or we could have seen the 1st read aligning on the right side:
[2nd read + strand] … [1st read - strand]
A strand-specific protocol means that we only observe fragments with the same strand as the gene.
However, many experiments, including the one we are examining are not strand-specific. This means that, for a plus strand gene, we will observe, in addition to the above two kinds of paired-end fragments, two more:
[1st read - strand] … [2nd read + strand]
[2nd read - strand] … [1st read + strand]
Subsequently, we are going to check the qualities of the fastq file SRR1039508_sub_2.fastq
with fastqc.
Please type your code below; if you run it successfully, you can find the output a HTML report in the current directory.
fastqc SRR1039508_sub_2.fastq
## Started analysis of SRR1039508_sub_2.fastq
## Approx 5% complete for SRR1039508_sub_2.fastq
## Approx 10% complete for SRR1039508_sub_2.fastq
## Approx 15% complete for SRR1039508_sub_2.fastq
## Approx 20% complete for SRR1039508_sub_2.fastq
## Approx 25% complete for SRR1039508_sub_2.fastq
## Approx 30% complete for SRR1039508_sub_2.fastq
## Approx 35% complete for SRR1039508_sub_2.fastq
## Approx 40% complete for SRR1039508_sub_2.fastq
## Approx 45% complete for SRR1039508_sub_2.fastq
## Approx 50% complete for SRR1039508_sub_2.fastq
## Approx 55% complete for SRR1039508_sub_2.fastq
## Approx 60% complete for SRR1039508_sub_2.fastq
## Approx 65% complete for SRR1039508_sub_2.fastq
## Approx 70% complete for SRR1039508_sub_2.fastq
## Approx 75% complete for SRR1039508_sub_2.fastq
## Approx 85% complete for SRR1039508_sub_2.fastq
## Approx 90% complete for SRR1039508_sub_2.fastq
## Approx 100% complete for SRR1039508_sub_2.fastq
## Analysis complete for SRR1039508_sub_2.fastq
The directory can be navigated with the “Files” panel on the right. Click and open the HTML report in a new page.
Question 3.1: Which basepair has the lowest 10% quality score?
Your Answer: 63
Question 3.2: What base is most common in the first position of the read?
Your Answer: T
Next, using the code demonstrated in the course slides, apply Trim Galore to cut the paired end reads.
trim_galore --paired SRR1039508_sub_1.fastq SRR1039508_sub_2.fastq
Question 3.3: What percentage of the read-pairs are dropped after running trim_galore? (infer it from trim_galore stdout)
Your Answer: 0.92%
The hisat2 index of hg19 can be directly downloaded from the hisat2 webpage
I store the previously downloaded index under the directory of /data/hg19_idx/genome
.
Next, you need to align the pairs of “trimmed reads” using the code demonstrated in the lecture, name the output SAM file as SRR1039508.sam
hisat2 -x /data/hg19_idx/genome -1 SRR1039508_sub_1_val_1.fq -2 SRR1039508_sub_2_val_2.fq -S SRR1039508.sam
## 19815 reads; of these:
## 19815 (100.00%) were paired; of these:
## 796 (4.02%) aligned concordantly 0 times
## 18162 (91.66%) aligned concordantly exactly 1 time
## 857 (4.33%) aligned concordantly >1 times
## ----
## 796 pairs aligned concordantly 0 times; of these:
## 152 (19.10%) aligned discordantly 1 time
## ----
## 644 pairs aligned 0 times concordantly or discordantly; of these:
## 1288 mates make up the pairs; of these:
## 699 (54.27%) aligned 0 times
## 512 (39.75%) aligned exactly 1 time
## 77 (5.98%) aligned >1 times
## 98.24% overall alignment rate
The mapping efficiency is a crucial statistics for NGS data analysis; a high mapping rate (> 80 %) usually indicate a fine library quality.
It is defined as: \[\text{Mapping efficiency}= \frac{\text{# of uniquely best mapped reads}}{\text{# of total reads}} \]
In case of the alignment of paired-end reads, the numerator of interest is the number of uniquely best concordant aligned reads.
Question 4.1: Please enter the mapping efficiency returned by the hisat2 report.
Your Answer: 98.24%
The SAM format and its compressed version, BAM, contains information about the alignment of sequences to a reference genome.
Section 1.3 describes the header of the SAM file.
Check the head and end of the SAM file generated in last section with head
and tail
:
head SRR1039508.sam
tail SRR1039508.sam
## @HD VN:1.0 SO:unsorted
## @SQ SN:chr1 LN:249250621
## @SQ SN:chr10 LN:135534747
## @SQ SN:chr11 LN:135006516
## @SQ SN:chr11_gl000202_random LN:40103
## @SQ SN:chr12 LN:133851895
## @SQ SN:chr13 LN:115169878
## @SQ SN:chr14 LN:107349540
## @SQ SN:chr15 LN:102531392
## @SQ SN:chr16 LN:90354753
## SRR1039508.19996 99 chr10 45953638 60 62M = 45953676 101 TTCCTTCAGCTCTTCATGGATGTCCAGGAAAATGACAACCCGCACACAATCAGACGTGAATG FE@FFGEGGIIAHDHI>BE@DDEEEBFHGIIIEG@FHGED@FGIGIHGIIIIBACABB?BB@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:62 YS:i:0 YT:Z:CP NH:i:1
## SRR1039508.19996 147 chr10 45953676 60 63M = 45953638 -101 CCCGCACACAATCAGACGTGAATGATTTCTGCTCCAGTGTCTTCAGGCTCTGTGCAACAAGAA DFHHA>EDCEGGGCFFC==@GB8?HIJHGHGAGFD>GGCEIGGDDGHGF?FAHHGCFCFHHFB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP NH:i:1
## SRR1039508.19997 83 chr9 4741096 60 63M = 4740970 -189 ACTGAGGCCCGCACCGCGCGGGTACCAGGGCTTTGGCCTGGCCTGCGCGCTCACCCGCTCGGC ADCBDBBBDBD<5DDDDDDDDC>3DBCDDDDDDDDDDDDDDDDDDDHHIIGH@JJJJJJJJJH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP NH:i:1
## SRR1039508.19997 163 chr9 4740970 60 63M = 4741096 189 CCCCGCTGGAGAGGTGCTTCAGCTCGAAGTGTGTAGTGATGCGCGACGACACGGTGCCCTTGC HJJJJJJJJHIHJJBGHJJJJJJJJJHIJDG@GHIIGIIHHHHFFDDDDDDDDD@DBDDDDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP NH:i:1
## SRR1039508.19998 83 chr7 140154370 60 62M = 140154314 -118 CATCTTCAGAGTCTGTTAGTTCGTCGTCCCCACCTGCAGCCAAAAGCATAAGCAACATCTCG ::BBBB@:((<3>AAA>2B?>DB=EA55</0(B?3AD?:<FFFGFFGCGBED@?HFC1E1BF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:62 YS:i:0 YT:Z:CP NH:i:1
## SRR1039508.19998 163 chr7 140154314 60 63M = 140154370 118 TTGCTATAGATCCAAGTCATAAAAATCTTCCAGCTCATCATGAAACAAGTCCCACTCATCTTC :??EFGGGG+AFBHD?9?C>DDFHJID<HIIJAEGIEIFHCF@EBFGCG@DHGGGHGHHHHHH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP NH:i:1
## SRR1039508.19999 83 chr10 64957302 60 22M949N41M = 64957229 -136 GCTCTTTTCCAGGCAATTTTGGCATCCTTTTTCACCCAGGACAAAGCTGTTTTTTCAGATGTT FFFFHHHHHJJJJJJJJJJIJJJJJJJJJJIGGDJJJJJHGJJJJJJJJJJJJJJJJJJJJJH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP XS:A:- NH:i:1
## SRR1039508.19999 163 chr10 64957229 60 63M = 64957302 136 TTTTTGGCAGACCCAGTGAATGTTAAACAATGTTGCTTCACATGCATCACACATCTCCCGGAC HJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJJHIJJJJHIJJJJJJJJJJJJJJJJJJJHHFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP NH:i:1
## SRR1039508.20000 99 chr7 128399866 60 41M7603N22M = 128407696 290 TAAGAATGCTGATGGTTTCATTGATCTAGAAGAGTATATTGGTGACATGTACAGCCATGATGG HJJJJJJJJJJIJJJHJJJJJJIJJJJJJIJJJJFHIJJJJJIIJJIJJIIIIJIJJJJIJJJ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP XS:A:+ NH:i:1
## SRR1039508.20000 147 chr7 128407696 60 14M1407N49M = 128399866 -290 CAGACCAAAACAAGGATGGCAAGCTTACCAAGGAGGAGATCGTTGACAAGTATGACTTATTTG FHHHHIJJJHJJJJJJJJJJJJIJIJHJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJJJJH AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:63 YS:i:0 YT:Z:CP XS:A:+ NH:i:1
Question 5.1: What is the length of chr10 according to the SAM file?
Answer: 135,534,747
The CIGAR string tells us how the read aligned to the genome, using numbers and letters, M for match, I for insertion, D for deletion, N for skipped bases on the reference genome.
Question 5.2: What is the CIGAR string for the last read? Try to interpret on the presence of a long gap region reflected by the CIGAR string.
Answer: chr11_gl000202_random A CIGAR string is a way of representing how a query sequence aligns to a reference genome1. It uses different operators to indicate matches, mismatches, insertions, deletions and gaps2. A long gap region in a CIGAR string means that there is a large distance between two aligned segments of the query sequence on the reference genome3. This could be due to splicing events or structural variations3.
Bitwise FLAG is an integer which encodes possible YES/NO answers to the choices below. The bitwise FLAG is most easily decoded at this link, Explain SAM Flags, for the PICARD software.
Question 5.3: Refer to the PICARD webpage, explain the SAM FLAG of the last line of read.
Answer: 147 read paired, read mapped in proper pair, read reverse strand, second in pair
SAM can be converted into BAM using samtools
:
samtools view -b SRR1039508.sam -o SRR1039508_unfiltered.bam
A filter for FLAG score can be passed with -F INT
argument, and the output will only include reads with none of the FLAGS in INT present.
Findout the FLAG score that have the following attributes, and then pass it to the following code chunk.
samtools view -F 3852 -b SRR1039508.sam -o SRR1039508_filtered.bam
ls -l SRR1039508_unfiltered.bam
ls -l SRR1039508_filtered.bam
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2777669 Mar 17 22:31 SRR1039508_unfiltered.bam
## -rw-r--r-- 1 Huangtianchi.Zhu19 Huangtianchi.Zhu19 2598575 Mar 17 22:31 SRR1039508_filtered.bam
Question 5.4: Compare the file sizes (use ls -l
) with and without the FLAG filter, what proportion of the reads are dropped?
Answer: 179,094
In this assessment we will use IGV to visualize the BAM file SRR1039508_subset.bam
under your current directory. Make sure to download the BAM and its index (+.bai
) into your own computer. The download will be triggered by selecting the file and clicking More > Export.
IGV can be downloaded from this link.
Once you have started IGV, you may have to click “More…” under genomes to get hg19.
Zoom in to gene SRM which is within the range chr1:11,112,000-11,122,000.
Question 6.1: Which strand is this gene: plus or minus?
Answer: minus
Question 6.2: In the first exon (rightmost) there is an alternate allele (C) to the reference (T). What is the genomic location of this SNP? (The number of the basepair only, don’t include the chromosome.)
Answer: 11,119,978
Look up the region around the SNP: chr1:11,119,891-11,119,910 in the UCSC Genome Browser (hg19).
You can right click and “Show details” for objects in the UCSC Genome Browser.
This appears to be a known and synonymous variant (meaning it does not change the amino acid sequence in the resulting protein because both codons code for the same amino acid: S (TCA) –> S (TCG).
Back in IGV, right click on the BAM file track and choose Squished. We want to count how many reads are in a particular intron. First we will make sure that we see all the possible reads.
Click View > Preferences > Alignments tab. At the top, make sure that “Downsample reads” is NOT checked. Later if you are using IGV for your own work, you may want to check this again, as it speeds up the visualization. But for this question we want to see all the reads.
Question 6.3: How many reads are in the first intron (the rightmost intron)? (Just count by eye the number of reads falling entirely in this intron.)
Answer: 4
In this question, we want to quantify gene expression levels by counting reads from a set of aligned RNA-Seq samples. The codes in the chunk bellow are R codes, as we will use the SummarizedOverlaps
function in R to perform read count.
library(airway) #The airway package is used here since it contains a collection of exemplar RNA-seq BAM files
dir <- system.file("extdata", package="airway", mustWork=TRUE)
list.files(dir) #To see all the example data stored in airway package
## [1] "GSE52778_series_matrix.txt" "Homo_sapiens.GRCh37.75_subset.gtf"
## [3] "quants" "sample_table.csv"
## [5] "SraRunInfo_SRP033351.csv" "SRR1039508_subset.bam"
## [7] "SRR1039509_subset.bam" "SRR1039512_subset.bam"
## [9] "SRR1039513_subset.bam" "SRR1039516_subset.bam"
## [11] "SRR1039517_subset.bam" "SRR1039520_subset.bam"
## [13] "SRR1039521_subset.bam"
csvfile <- file.path(dir,"sample_table.csv")
(sampleTable <- read.csv(csvfile,row.names=1)) #Load and examine the table for the sample information
## SampleName cell dex albut Run avgLength Experiment
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358
## Sample BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677
filenames <- file.path(dir, paste0(sampleTable$Run, "_subset.bam"))
file.exists(filenames)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
After checking the directories of the BAM files on the genome, we will use Rsamtools package to scan all the BAM files and combine them into a BamFileList
object.
library(Rsamtools)
bamfiles <- BamFileList(filenames, yieldSize=2000000) #yieldSize argument can control memory usage
seqinfo(bamfiles[1]) #Checking the sequence info stored in the first BAM files
## Seqinfo object with 84 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## 1 249250621 <NA> <NA>
## 10 135534747 <NA> <NA>
## 11 135006516 <NA> <NA>
## 12 133851895 <NA> <NA>
## 13 115169878 <NA> <NA>
## ... ... ... ...
## GL000210.1 27682 <NA> <NA>
## GL000231.1 27386 <NA> <NA>
## GL000229.1 19913 <NA> <NA>
## GL000226.1 15008 <NA> <NA>
## GL000207.1 4262 <NA> <NA>
Next, we will make an annotation object (TxDb) directly from the GTF files using makeTxDbFromGFF()
. Again, the GRangesList
containing the exons grouped by genes is extracted by the function exonsBy()
.
library(GenomicFeatures)
library(GenomicAlignments)
gtffile <- file.path(dir,"Homo_sapiens.GRCh37.75_subset.gtf")
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character()))
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: /usr/local/lib/R/site-library/airway/extdata/Homo_sapiens.GRCh37.75_subset.gtf
## # Organism: NA
## # Taxonomy ID: NA
## # miRBase build ID: NA
## # Genome: NA
## # Nb of transcripts: 65
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2023-03-17 22:31:26 +0800 (Fri, 17 Mar 2023)
## # GenomicFeatures version at creation time: 1.41.0
## # RSQLite version at creation time: 2.2.0
## # DBSCHEMAVERSION: 1.2
(exbg <- exonsBy(txdb, by="gene")) #GrangesList object stores the gene features
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 1 11086580-11087705 - | 98 ENSE00000818830
## [2] 1 11090233-11090307 - | 99 ENSE00000472123
## [3] 1 11090805-11090939 - | 100 ENSE00000743084
## [4] 1 11094885-11094963 - | 101 ENSE00000743085
## [5] 1 11097750-11097868 - | 102 ENSE00003482788
## ... ... ... ... . ... ...
## [14] 1 11106948-11107176 - | 111 ENSE00003467404
## [15] 1 11106948-11107176 - | 112 ENSE00003489217
## [16] 1 11107260-11107280 - | 113 ENSE00001833377
## [17] 1 11107260-11107284 - | 114 ENSE00001472289
## [18] 1 11107260-11107290 - | 115 ENSE00001881401
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## ...
## <19 more elements>
exbg #Try to understand the structure of this object; each row/ranges is an exon, and $`X` is a gene id.
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 1 11086580-11087705 - | 98 ENSE00000818830
## [2] 1 11090233-11090307 - | 99 ENSE00000472123
## [3] 1 11090805-11090939 - | 100 ENSE00000743084
## [4] 1 11094885-11094963 - | 101 ENSE00000743085
## [5] 1 11097750-11097868 - | 102 ENSE00003482788
## ... ... ... ... . ... ...
## [14] 1 11106948-11107176 - | 111 ENSE00003467404
## [15] 1 11106948-11107176 - | 112 ENSE00003489217
## [16] 1 11107260-11107280 - | 113 ENSE00001833377
## [17] 1 11107260-11107284 - | 114 ENSE00001472289
## [18] 1 11107260-11107290 - | 115 ENSE00001881401
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## ...
## <19 more elements>
The BamFileList, together with the GRangesList for gene features is now passed to the summarizedOverlaps()
function. The function is an R equivalent implementation of the HTSeq package. You can type ?summarizedOverlaps()
in the console to see the documentation on its arguments.
se <- summarizeOverlaps(features=exbg, # The gene features of interest, in our case exonic regions of genes.
reads=bamfiles, # The BAM file list for the aligned reads
mode="Union", # describes the overlapping rule applied during read count.
singleEnd=FALSE, # Should the read being treated as single end
ignore.strand=TRUE, # Should the read count ignoring strands (FALSE for strand specific library only)
fragments=TRUE ) # Should read be counted as fragment (TRUE for paired-end only)
The output of the summarizedOverlaps()
is a SummarizedExperiment object. It is a very powerful R object for genomics having a matrix form (associated with dim
). The row features / gene features in SummarizedExperiment can be accessed with rowRanges(se)
, and the count matrix can be accessed with assay(se)
.
dim(se) ## 20 genes, 8 samples / libraries
## [1] 20 8
head(assay(se), 3) ## See the count of the first 3 rows
## SRR1039508_subset.bam SRR1039509_subset.bam
## ENSG00000009724 38 28
## ENSG00000116649 1004 1255
## ENSG00000120942 218 256
## SRR1039512_subset.bam SRR1039513_subset.bam
## ENSG00000009724 66 24
## ENSG00000116649 1122 1313
## ENSG00000120942 233 252
## SRR1039516_subset.bam SRR1039517_subset.bam
## ENSG00000009724 42 41
## ENSG00000116649 1100 1879
## ENSG00000120942 269 465
## SRR1039520_subset.bam SRR1039521_subset.bam
## ENSG00000009724 47 36
## ENSG00000116649 745 1536
## ENSG00000120942 207 400
colSums(assay(se)) ## Calculate sum of reads in each sample / library
## SRR1039508_subset.bam SRR1039509_subset.bam SRR1039512_subset.bam
## 6478 6501 7699
## SRR1039513_subset.bam SRR1039516_subset.bam SRR1039517_subset.bam
## 6801 8009 10849
## SRR1039520_subset.bam SRR1039521_subset.bam
## 5254 9168
rowRanges(se) ## The GRangesList for the row gene features
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 1 11086580-11087705 - | 98 ENSE00000818830
## [2] 1 11090233-11090307 - | 99 ENSE00000472123
## [3] 1 11090805-11090939 - | 100 ENSE00000743084
## [4] 1 11094885-11094963 - | 101 ENSE00000743085
## [5] 1 11097750-11097868 - | 102 ENSE00003482788
## ... ... ... ... . ... ...
## [14] 1 11106948-11107176 - | 111 ENSE00003467404
## [15] 1 11106948-11107176 - | 112 ENSE00003489217
## [16] 1 11107260-11107280 - | 113 ENSE00001833377
## [17] 1 11107260-11107284 - | 114 ENSE00001472289
## [18] 1 11107260-11107290 - | 115 ENSE00001881401
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## ...
## <19 more elements>
colData(se) ## The table for the experimental design (column data table)
## DataFrame with 8 rows and 0 columns
(colData(se) <- DataFrame(sampleTable))
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run
## <character> <character> <character> <character> <character>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521
## avgLength Experiment Sample BioSample
## <integer> <character> <character> <character>
## SRR1039508 126 SRX384345 SRS508568 SAMN02422669
## SRR1039509 126 SRX384346 SRS508567 SAMN02422675
## SRR1039512 126 SRX384349 SRS508571 SAMN02422678
## SRR1039513 87 SRX384350 SRS508572 SAMN02422670
## SRR1039516 120 SRX384353 SRS508575 SAMN02422682
## SRR1039517 126 SRX384354 SRS508576 SAMN02422673
## SRR1039520 101 SRX384357 SRS508579 SAMN02422683
## SRR1039521 98 SRX384358 SRS508580 SAMN02422677
The quantified gene expression values are usually normalized in the forms of FPKM and RPKM.
The FPKM (Fragment Per Kb per Million Fragments) value for a gene feature is defined by the following equation: \[\text{FPKM}=\frac{\text{Fragment count of the gene feature} \times 10^3\times10^6}{\text{Total fragment count of the library} \times \text{Length of the gene feature}}\]
Likewise, the RPKM (Read Per Kb per Million Reads) value for a gene feature is described by the equation below: \[\text{RPKM}=\frac{\text{Read count of the gene feature} \times 10^3\times10^6}{\text{Total read count of the library} \times \text{Length of the gene feature}}\]
Using the count assay generated above, compute the matrix of FPKM value, save it in the FPKM
variable. Hint: The length of the gene features (exonic regions of genes) in the GRangeList
of exons grouped by genes can be calculated by sum(width(GRangeList))
.
You can write more than 1 line of code in "___" if you want.
FPKM <- (assay(se)*(10^9))/(sum(width(exbg)) %o% colSums(assay(se)))
head(FPKM)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000009724 1115.4227 818.9826 1630.0707 671.0197 997.16684
## ENSG00000116649 41121.2807 51219.7462 38666.2807 51223.0948 36440.82948
## ENSG00000120942 5094.2116 5961.0289 4581.2399 5609.0485 5084.34974
## ENSG00000120948 49861.2314 37566.1356 51134.3320 27864.0415 51588.66933
## ENSG00000171819 217.0385 2703.3829 867.4351 28063.6906 43.88736
## ENSG00000171824 12345.5127 15218.0357 13328.1799 14221.9858 10847.35866
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000009724 718.6062 1701.0012 746.6632
## ENSG00000116649 45952.6894 37621.8430 44451.9148
## ENSG00000120942 6488.2067 5964.0559 6604.6045
## ENSG00000120948 40216.0441 49610.8032 25485.4217
## ENSG00000171819 323.9873 936.6033 40907.9338
## ENSG00000171824 11918.3714 13102.1277 15960.7325
Modify the parameters used in summarizeOverlaps()
, so that the resulting count matrix can be used to calculate the RPKM. Afterward, calculate the RPKM value and save it in the RPKM
variable.
se2 <- summarizeOverlaps(features=exbg, reads=bamfiles,
mode="Union",
singleEnd=TRUE,
ignore.strand=TRUE,
fragments=FALSE)
RPKM <- (assay(se2)*(10^9))/(sum(width(exbg)) %o% colSums(assay(se2)))
head(RPKM)
## SRR1039508_subset.bam SRR1039509_subset.bam
## ENSG00000009724 1024.7184 804.9903
## ENSG00000116649 40397.4700 50324.6007
## ENSG00000120942 5009.5336 5859.1849
## ENSG00000120948 48961.1245 36817.8068
## ENSG00000171819 213.4308 2657.1957
## ENSG00000171824 13397.6392 16349.4812
## SRR1039512_subset.bam SRR1039513_subset.bam
## ENSG00000009724 1458.2072 589.7619
## ENSG00000116649 37980.7069 50178.5376
## ENSG00000120942 4517.7217 5503.0495
## ENSG00000120948 50204.8351 27312.0640
## ENSG00000171819 853.5766 27533.3468
## ENSG00000171824 14467.9357 15433.5100
## SRR1039516_subset.bam SRR1039517_subset.bam
## ENSG00000009724 910.31222 672.9454
## ENSG00000116649 35825.76608 45179.4393
## ENSG00000120942 4989.24302 6387.5276
## ENSG00000120948 50602.63362 39496.1119
## ENSG00000171819 43.14661 318.9599
## ENSG00000171824 12042.49410 13023.6910
## SRR1039520_subset.bam SRR1039521_subset.bam
## ENSG00000009724 1585.0304 681.2869
## ENSG00000116649 36951.7991 43529.8942
## ENSG00000120942 5869.6544 6467.9492
## ENSG00000120948 48660.5916 24895.1609
## ENSG00000171819 921.7784 40055.2616
## ENSG00000171824 14265.2391 17288.8793
hist(assay(se2)/assay(se)) #Draw a histogram of ratios between 2 count matrixes
hist(RPKM/FPKM)
Question 7: What have you discovered from the ratio of two count matrices (assays)? Please explain the observed pattern. If you divide the RPKM matrix by the FPKM matrix, would you expect to see the same phenomenon?
Answer: The images tend to be normally distributed, this suggests that the two sets of values are similar or exhibit a consistent scaling factor. In other words, the fold-change distribution will be centered around 1. Also, I divide the RPKM matrix by the FPKM matrix and then get the similar phenomenon.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## os Ubuntu 18.04.5 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Asia/Shanghai
## date 2023-03-17
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## airway * 1.10.0 2020-10-29 [2] Bioconductor
## AnnotationDbi * 1.51.0 2020-04-27 [2] Bioconductor
## askpass 1.1 2019-01-13 [2] CRAN (R 4.0.0)
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.0)
## Biobase * 2.49.0 2020-04-27 [2] Bioconductor
## BiocFileCache 1.13.0 2020-04-27 [2] Bioconductor
## BiocGenerics * 0.35.2 2020-05-06 [2] Bioconductor
## BiocParallel 1.23.0 2020-04-27 [2] Bioconductor
## biomaRt 2.45.0 2020-04-27 [2] Bioconductor
## Biostrings * 2.57.0 2020-04-27 [2] Bioconductor
## bit 1.1-15.2 2020-02-10 [2] CRAN (R 4.0.0)
## bit64 0.9-7 2017-05-08 [2] CRAN (R 4.0.0)
## bitops 1.0-6 2013-08-17 [2] CRAN (R 4.0.0)
## blob 1.2.1 2020-01-20 [2] CRAN (R 4.0.0)
## cli 2.0.2 2020-02-28 [2] CRAN (R 4.0.0)
## crayon 1.3.4 2017-09-16 [2] CRAN (R 4.0.0)
## curl 4.3 2019-12-02 [2] CRAN (R 4.0.0)
## DBI 1.1.0 2019-12-15 [2] CRAN (R 4.0.0)
## dbplyr 1.4.3 2020-04-19 [2] CRAN (R 4.0.0)
## DelayedArray * 0.15.1 2020-05-01 [2] Bioconductor
## digest 0.6.25 2020-02-23 [2] CRAN (R 4.0.0)
## dplyr 0.8.5 2020-03-07 [2] CRAN (R 4.0.0)
## ellipsis 0.3.0 2019-09-20 [2] CRAN (R 4.0.0)
## evaluate 0.14 2019-05-28 [2] CRAN (R 4.0.0)
## fansi 0.4.1 2020-01-08 [2] CRAN (R 4.0.0)
## GenomeInfoDb * 1.25.0 2020-04-27 [2] Bioconductor
## GenomeInfoDbData 1.2.3 2020-05-12 [2] Bioconductor
## GenomicAlignments * 1.25.0 2020-04-27 [2] Bioconductor
## GenomicFeatures * 1.41.0 2020-04-27 [2] Bioconductor
## GenomicRanges * 1.41.1 2020-05-02 [2] Bioconductor
## glue 1.4.0 2020-04-03 [2] CRAN (R 4.0.0)
## hms 0.5.3 2020-01-08 [2] CRAN (R 4.0.0)
## htmltools 0.4.0 2019-10-04 [2] CRAN (R 4.0.0)
## httr 1.4.1 2019-08-05 [2] CRAN (R 4.0.0)
## IRanges * 2.23.4 2020-05-03 [2] Bioconductor
## knitr 1.28 2020-02-06 [2] CRAN (R 4.0.0)
## lattice 0.20-41 2020-04-02 [4] CRAN (R 4.0.0)
## lifecycle 0.2.0 2020-03-06 [2] CRAN (R 4.0.0)
## magrittr 1.5 2014-11-22 [2] CRAN (R 4.0.0)
## Matrix 1.2-18 2019-11-27 [4] CRAN (R 4.0.0)
## matrixStats * 0.56.0 2020-03-13 [2] CRAN (R 4.0.0)
## memoise 1.1.0 2017-04-21 [2] CRAN (R 4.0.0)
## openssl 1.4.1 2019-07-18 [2] CRAN (R 4.0.0)
## pillar 1.4.4 2020-05-05 [2] CRAN (R 4.0.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.0)
## prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.0.0)
## progress 1.2.2 2019-05-16 [2] CRAN (R 4.0.0)
## purrr 0.3.4 2020-04-17 [2] CRAN (R 4.0.0)
## R6 2.4.1 2019-11-12 [2] CRAN (R 4.0.0)
## rappdirs 0.3.1 2016-03-28 [2] CRAN (R 4.0.0)
## Rcpp 1.0.4.6 2020-04-09 [2] CRAN (R 4.0.0)
## RCurl 1.98-1.2 2020-04-18 [2] CRAN (R 4.0.0)
## rlang 0.4.6 2020-05-02 [2] CRAN (R 4.0.0)
## rmarkdown 2.1 2020-01-20 [2] CRAN (R 4.0.0)
## Rsamtools * 2.5.0 2020-04-27 [2] Bioconductor
## RSQLite 2.2.0 2020-01-07 [2] CRAN (R 4.0.0)
## rtracklayer 1.49.1 2020-05-02 [2] Bioconductor
## S4Vectors * 0.27.5 2020-05-06 [2] Bioconductor
## sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 4.0.0)
## stringi 1.4.6 2020-02-17 [2] CRAN (R 4.0.0)
## stringr 1.4.0 2019-02-10 [2] CRAN (R 4.0.0)
## SummarizedExperiment * 1.19.2 2020-05-01 [2] Bioconductor
## tibble 3.0.1 2020-04-20 [2] CRAN (R 4.0.0)
## tidyselect 1.1.0 2020-05-11 [2] CRAN (R 4.0.0)
## vctrs 0.3.0 2020-05-11 [2] CRAN (R 4.0.0)
## withr 2.2.0 2020-04-20 [2] CRAN (R 4.0.0)
## xfun 0.13 2020-04-13 [2] CRAN (R 4.0.0)
## XML 3.99-0.3 2020-01-20 [2] CRAN (R 4.0.0)
## XVector * 0.29.0 2020-04-27 [2] Bioconductor
## yaml 2.2.1 2020-02-01 [2] CRAN (R 4.0.0)
## zlibbioc 1.35.0 2020-04-27 [2] Bioconductor
##
## [1] /home/Huangtianchi.Zhu19/R/x86_64-pc-linux-gnu-library/4.0
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library