The Outline for Lecture Practical 1

The major target of this assignment is to reproduce the core analysis of a paper published on in 2016. Next-Generation Sequencing (NGS) technology is currently the most popular data type in bioinformatics. Tens of thousands of NGS datasets are deposited on the published dataset every year, capturing the molecular profiles of DNA, RNA, and epigenetics in cells. In this project, our goal is to conduct a lower level EDA based on a RNA-Seq dataset generated from 2 sequencing centers. We will discover the major source of technical variation in this dataset, and a statistical tool will be applied to correct the technical error.

You could check the paper below for more of the background information:

Love, M. I., Hogenesch, J. B., & Irizarry, R. A. (2016). Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nature biotechnology, 34(12), 1287.

1. Check the Experiment Design and Data Directories

Our RNA-Seq dataset contains 4 samples sequenced from 2 centers under the platform of Illumina Hi-Seq 2000. All of the samples are originated from the same cell line and the same biological condition in the GEUVADIS project. The reads are previously aligned to the genome assembly hg19 using STAR, and the resulting BAM files are saved under the working directory of Lecture practical 1.

The sample information is summarized in the table samples.csv, and we can construct the BAM file directories from the table. Please use the string manipulation function paste0() to construct the names of the sorted BAM files from the run column of the table, then check if all the files exist under the current working directory with file.exists().

Code Chunk1

## === Hint code, fill the "___" ============= ##
#samples <- read.csv(___)
#bamfiles <- paste0(___) #Paste the sample id with "_sort.bam"
#file.exists(bamfiles) #Check if the bam files exist or not
## ===== Enter your code below  =============== ##
samples <- read.csv("samples.csv")
library(GenomicAlignments)
bamfiles <- paste0(samples$run,"_sort.bam")
file.exists(bamfiles)
## [1] TRUE TRUE TRUE TRUE

2. Extract the Longest Transcript of the Gene USF2

In the following analysis, we want to investigate the RNA-Seq fragment coverage on the transcript of gene USF2. First, please extract the exon regions of the longest transcript of gene USF2 on hg19. The expected output in this step is a GRangesList object with length 1, which contains all exons of the transcript. Name the resulting variable by usf2_longest.

To achieve this purpose, you need to first use the select() function to find the transcript ids and transcript name associated with the gene USF2. You could use keytypes(Homo.sapiens) and columns(Homo.sapiens) to check all of the id conversion options existed in the Human organismDb package. The keys are things can be converted, and the columns are things can be converted into.

USF2 is a gene symbol, so it should have the keytype of "SY___" in the organismDb object, and the corresponding columns should be c("TX___", "TX___") so that we could retrieve both the transcript id and the transcript name from the database.

The output of exonsBy(txdb, by = "tx") is a GRangesList, each element within correspond to the exons of a transcript. You could access the length of each element in a list object with elementNROWS().

Code Chunk2

## === Hint code, fill the "___" ============= ##
# library(TxDb.___.UCSC.___.knownGene)
# library(Homo.sapiens)
# txdb <- TxDb.___.UCSC.___.knownGene
# g <- list()
# g[["USF2"]] <- select(Homo.sapiens, "USF2", ___ , ___)
# ebt <- exonsBy(txdb, by="tx")
# head(names(ebt)) #Check the structure of exonsByTranscript, the names are TXIDs
# usf2 <- ebt[ g[["USF2"]]$TXID ]
# usf2_longest <- usf2[which.max(___)] #You need to only keep the longest transcript among the isoforms
## ===== Enter your code below  =============== ##
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(Homo.sapiens)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
g <- list()
g[["USF2"]] <- select(Homo.sapiens, "USF2", c("TXID","TXNAME"), "SYMBOL")
ebt <- exonsBy(txdb, by="tx")
head(names(ebt))
## [1] "1" "2" "3" "4" "5" "6"
usf2 <- ebt[ g[["USF2"]]$TXID ]
usf2_longest <- usf2[which.max(sum(width(usf2)))]

SAQ1

3. Load the BAM Files as Fragments on Transcript

Next, we want to write a function that will take the inputs of the BAM file directory and the GRangesList of the transcript, and it will return the GRanges of fragments “on the transcript coordinates” (without introns).

The first step in this function is to read the BAM file using readGAlignmentPairs(). At this step, we need to set the param= argument with ScanBamParam() to define the filters used when reading BAM files: the loaded reads should lie within range of the target transcript, and all of the reads with mapping quality scores less than 30 will be discarded. Please see ?ScanBamParam() on how to set the BAM filtering parameters.

Subsequently, we need to only keep the aligned reads that are compatible to the exons of the transcript. This purpose can be achieved using the function findCompatibleOverlaps(). Then, the coordinates of the loaded reads are based on genome, so we need to convert them to the coordinate of the transcript using the function map___Transcripts().

Please fillin the missing portion of the code chunk below to realize the function readTxFragments().

Code Chunk3

readTxFragments <- function(file, transcript) {
  r <- range(transcript[[1]]) # Get the range of the target transcript on genome
  r <- keepStandardChromosomes(r) # Only keep the transcript on standard chromosome
  
  
  # Read the BAM files with required filters, and find the compatible reads
  ## === Hint code, fill the "___" ============= ##
  # suppressWarnings({
  # gap <- readGAlignmentPairs(file, param=ScanBamParam(which = ___, mapqFilter = ___))
  # }) # suppress warnings about alignments with ambiguous pairing
  # fco <- findCompatibleOverlaps(___, ___)
  ## ===== Enter your code below  =============== ##
   suppressWarnings({
    gap <- readGAlignmentPairs(file, param=ScanBamParam(which=r,mapqFilter = 30))
  })
  fco <- findCompatibleOverlaps(gap, transcript)
  ## ===== Your code is finished  =============== ##
  idx <- queryHits(fco) # get index for the compatible reads
  gr <- as(gap[idx],"GRanges") # subset the reads with index and convert it into GRanges
  strand(gr) <- "*" # set the strand into both stands (*), this will fill the insert and make a pair of reads into 1 range. 
  
  # Convert the paired-end reads into fragments on exons by mapping the ends of the reads to transcript coordinate.
  ## === Hint code, fill the "___" ============= ##
  # m2tx.start <- map___Transcripts(resize(___, width=1, fix="start"), ___)
  # m2tx.end <- map___Transcripts(resize(___, width=1, fix="end"), ___)
  ## ===== Enter your code below  =============== ##
  m2tx.start <- mapToTranscripts(resize(gr, width=1, fix="start"),
                                 transcript)
  m2tx.end <- mapToTranscripts(resize(gr, width=1, fix="end"),
                               transcript)
  ## ===== Your code is finished  =============== ##
  #Flip the start & end if the transcript is on the negative strand.
  tx.strand <- as.character(strand(transcript)[[1]][1])
  if (tx.strand == "+") {
    m2tx <- GRanges(seqnames(m2tx.start),
                    IRanges(start(m2tx.start),start(m2tx.end)))
  } else {
    m2tx <- GRanges(seqnames(m2tx.start),
                    IRanges(start(m2tx.end),start(m2tx.start)))
  }
  return(m2tx)
}

Fragment_list <- lapply(bamfiles, readTxFragments, transcript = usf2_longest)

SAQ2

4. Plot the Fragment Length Distribution

After obtaining the aligned fragments on the transcript. We want to compare the lengths of the fragments in different samples. In the following code chunk, please manipulate the Fragment_list with another lapply() function to calculate the fragment lengths for each sample. Then, run the following code to plot the fragment length distributions.

Code Chunk4

## === Hint code, fill the "___" ============= ##
#Fragment_len <- lapply(Fragment_list, ___)
## ===== Enter your code below  ============== ##
Fragment_len <- lapply(Fragment_list, width)
## ===== Your code is finished  =============== ##
plot_df <- data.frame( fragment_length = unlist( Fragment_len ),
                   sample = rep(samples$run, elementNROWS(Fragment_list)),
                   batch = rep(paste0("center ",samples$center), elementNROWS(Fragment_list)) )

library(ggplot2)
library(ggsci) #A package for good looking colours

ggplot(plot_df, aes(x = fragment_length, color = sample, linetype = batch)) + geom_density() +
  theme_classic() + scale_color_npg() + labs(x = "Fragment Lengths")

Next, please calculate the average fragment lengths for each sample (BAM file), and store the vector within the variable named average_frag_len.

Code Chunk5, Mark: 3%

## === Hint code, fill the "___" ============= ##
# average_frag_len <- sapply(Fragment_len, ___ )
## ===== Enter your code below  ============== ##
average_frag_len <- sapply(Fragment_list, function(x) mean(width(x)))

SAQ3, mark: 8%

P.S. please see this link to help you clarify the differences between fragment, read, insert, and inner distance.

The definition of a fragment in our project actually corresponds to the ‘insert’ region mentioned in the link above. The term insert and fragment are often used interchangeably, because the adapters are removed in order to align the reads to the genome. But we should know that the “full” fragments in NGS do have adapter sequences at the ends.

5. Check the Fragment Coverage on the Transcript

In the next step, we want to visualize the “coverage” of sequencing fragments on the target transcript. The coverages can be understood as the heights obtained by stacking all of the features along the coordinate. In Bioconductor, we can compute the coverage of GRanges using the function coverage().

Again, for each list element of Fragment_list, compute the coverage, and convert the coverage into the integer vectors of counts. Store the coverage list in a variable named by coverage_lst.

Code Chunk6

## === Hint code, fill the "___" ============= ##
# coverage_lst <- lapply(___, function(x) as.vector(___[[1]]))
## === Enter your code below ================= ##
coverage_lst <- lapply(Fragment_list, function(x) as.vector(coverage(x)[[1]]))
## === Your code is finished  ================ ##
plot_df <- data.frame( tx_position = unlist(lapply(coverage_lst,seq_along)),
                       coverage = unlist( coverage_lst ),
                       sample = rep(samples$run, elementNROWS(coverage_lst)),
                       batch = rep(paste0("center ",samples$center), elementNROWS(coverage_lst)) )

ggplot(plot_df, aes(x = tx_position, y = coverage, color = sample, linetype = batch)) + geom_line() +
  theme_classic() + scale_color_npg() + labs(x = "transcript coordinate")

SAQ4

6. Explore Fragment level GC Content Distributions

Using the Fragment_list generated before, extract the fragment GC contents on the longest transcript over the 4 samples. The sequence of a transcript can be retrieved with extractTranscriptSeqs(). Then, apply Views() to extract the sequences underneath the fragment ranges. Store the GC contents of each list element in a variable named by GC_list.

Code Chunk7

## === Hint code, fill the "___" ============= ##
#library(BSgenome.___.UCSC.___)
#usf2.seq <- extractTranscriptSeqs(___,___)[[1]]
#GC_list <- lapply(Fragment_list,function(x) ___(___(Views(___, ranges(x)) ), ___, as.prob=___))
## === Enter your code below ================= ##
library(BSgenome.Hsapiens.UCSC.hg19)
usf2.seq <- extractTranscriptSeqs(Hsapiens, usf2_longest)[[1]]
GC_list <- lapply(Fragment_list, function(x) letterFrequency(DNAStringSet( Views(usf2.seq, ranges(x)) ), "GC", as.prob=TRUE))
## === Your code is finished  ================ ##
plot_df <- data.frame( GC_content = unlist( GC_list ),
                       sample = rep(samples$run, elementNROWS(GC_list)),
                       batch = rep(paste0("center ",samples$center), elementNROWS(GC_list)) )

ggplot(plot_df, aes(x = GC_content, color = sample, linetype = batch)) + geom_density() +
  theme_classic() + scale_color_npg() + labs(x = "Fragment GC content")

SAQ5, mark: 8%

7. Fit the Linear Relationships Between GC and Fragment Coverages

In the previous question, we have explored the “marginal distributions” of fragment GC contents among samples. Then, we will attempt to investigate the “joint distributions” between GC and coverages. If the modeling on the joint is good enough, the “offsets” of the coverages can be estimated, which can be used to correct the GC content bias.

To model the relationship between GC and coverage, we need to first identify all the potential positions of the sequencing fragments, because many fragments that potentially can be generated have 0 counts in our data. The following code chunk will help us generate the coverage for potential fragments.

You are required to fill the central step in this procedure, which is to use the function slidingWindows() to generate potential fragments on the coordinate of the target transcript. Set the window size into the average fragment length calculated from each sample, and set the steps equal 1L (move one base at a time).

Code Chunk8

#Convert fragment into the end POS and compute their coverages
POS <- lapply(Fragment_list, function(x) c(resize(x,1,fix = "start"),
                                           resize(x,1,fix = "end")) )

POS_coverage <- lapply(POS, function(x) as.numeric(coverage(x)[[1]]))

#Get the range of the transcript on its own coordinate
usf2_tx <- range(mapToTranscripts(
                     unlist(usf2_longest),
                            usf2_longest)) 

#Find potential fragments for each sample 

## === Hint code, fill the "___" ============= ##
#PotentialFragments <- lapply( round(average_frag_len),
#                             function(x) slidingWindows(___,___,___)[[1]] )
## === Enter your code below ================= ##
PotentialFragments <- lapply( round(average_frag_len),
                             function(x) slidingWindows(usf2_tx,x,1L)[[1]] )
## === Your code is finished  ================ ##

#Count both ends of the potential fragments
Fragment_counts <- mapply(function(x,y) y[start(x)]+y[end(x)],
                             PotentialFragments,
                             POS_coverage) 

#Calculate GC on potential fragments
GC_listp <- lapply(PotentialFragments,
                   function(x) letterFrequency(DNAStringSet( 
                                  Views(usf2.seq, ranges(x)) ), 
                                        "GC", as.prob=TRUE))

#Build a data.frame for plotting
plot_df <- data.frame( GC_content = unlist( GC_listp ),
                       Fragment_count = unlist(Fragment_counts),
                       sample = rep(samples$run, elementNROWS(GC_listp)),
                       batch = rep(paste0("center ",samples$center),
                                   elementNROWS(GC_listp)) )

Then, we will try to fit an ordinary linear regression model using smoothing splines of 5 knots. The x and y are the GC content and end coverage of potential fragments, respectively.

Gaussian Regression:

## define the smooth function for ordinary linear regression
lm_smooth <- function(...) {
  geom_smooth(method = "lm", ...)
}

ggplot(plot_df, aes(x = GC_content, y = Fragment_count,
                    color = sample, linetype = batch)) + 
        lm_smooth(formula = y ~ splines::ns(x, 5)) +
        theme_classic() + scale_color_npg() + 
        labs(x = "Fragment GC content", 
             y = "Fragment count",
             main = "Gaussian Regression")

We have obtained the fitted curves between GC and coverage for different samples. Subsequently, we will try an alternative model called Poisson regression. Please fill out the function poisson_smooth() so that ggplot2 could fit a Poisson regression on our data.

P.S. the Poisson regression is a type of generalized linear model. In R, Poisson glm can be called with the function glm() and family = “poisson”. The difference between Poisson regression and ordinary linear regression is that the Poisson regression treats the noises as Poisson distributions, while the OLR teats the noises as zero centered normal distributions.

Poisson Regression:

Code Chunk9

poisson_smooth <- function(...) {
## === Hint code, fill the "___" ============= ##
#geom_smooth(method = ___, method.args = list(family = ___), ...)
## === Enter your code below ================= ##
geom_smooth(method = "glm", method.args = list(family = "poisson"), ...)
## === Your code is finished  ================ ##  
}

ggplot(plot_df, aes(x = GC_content, y = Fragment_count,
                    color = sample, linetype = batch)) +  
        poisson_smooth(formula = y ~ splines::ns(x, 5)) +
        theme_classic() + scale_color_npg() + 
        labs(x = "Fragment GC content",
             y = "Fragment count",
             main = "Poisson Regression")

SAQ6

  • SAQ6: The grey area around the curves indicates the confidence intervals for the estimations of the conditional expectations (regression). From your opinion, which model fits better, and why? Could you try to give some reasons behind the model choices? (please think about the nature of the reads coverage datatype.)

  • Answer: The Poisson regression fits better because its parameter estimates have smaller standard errors (narrower confidence interval). We will choose the Poisson model due to the response variable in our model being the reads coverage in NGS. The nature of reads coverages is small count values that have shoot noise of Poisson distribution.

8. Correct the Fragment Coverage by Straightening the GC content linear effects

Using your selected model, please complete the following code to obtain the predicted values of reads coverage by fragment GC. Then, the fitted values will be used as offsets to conduct normalization on the sequencing data. Hopefully, this measure could help to remove the technical errors induced by GC.

Code Chunk10

## === Hint code, fill the "___" ============= ##
#offsets <- mapply(function(x,y){predict(___(y~splines::ns(x,5), 
#                  ___ = ___),data.frame(x))},
#                  GC_listp, Fragment_counts) #mapply() is an iterative functional with 2 inputs, see ?mapply
## === Enter your code below ================= ##
offsets <- mapply(function(x,y){predict(glm(y~splines::ns(x,5), 
                  family = poisson),data.frame(x))},
                  GC_listp, Fragment_counts) 
## === Your code is finished  ================ ##

#The read coverage is normalized by dividing the original fragment counts by the exponentiated offsets, the exponetiation is necessary if the prediction is made by poisson GLM which is on the log scale.
normalized_counts <- mapply(function(x,y) x/exp(y), Fragment_counts, offsets)


library(zoo)


#The normalized coverage is calculated by the rolling sums (rollsum) of the normalized counts, while the ends are computed by cumulative sums (cumsum).
normalized_coverages <- mapply(function(x,y) 
                                c(cumsum(x[1:y]),rollsum(x,y),
                                  rev(cumsum(x[length(x):(length(x)-y)])))/2,
                               normalized_counts,
                               average_frag_len) 

#Finally, plot the normalized reads coverage with ggplot2.
plot_df <- data.frame( tx_position = unlist(lapply(normalized_coverages,seq_along)),
                       coverage = unlist( normalized_coverages  ),
                       sample = rep(samples$run, elementNROWS(normalized_coverages)),
                       batch = rep(paste0("center ",samples$center), elementNROWS(normalized_coverages)) )

ggplot(plot_df, aes(x = tx_position, y = coverage, color = sample, linetype = batch)) + geom_line() +
  theme_classic() + scale_color_npg() + labs(x = "transcript coordinate", y = "normalized coverage")

SAQ6

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] zoo_1.8-9                              
##  [2] BSgenome.Hsapiens.UCSC.hg19_1.4.3      
##  [3] BSgenome_1.57.0                        
##  [4] rtracklayer_1.49.1                     
##  [5] ggsci_2.9                              
##  [6] ggplot2_3.3.0                          
##  [7] Homo.sapiens_1.3.1                     
##  [8] org.Hs.eg.db_3.11.1                    
##  [9] GO.db_3.11.1                           
## [10] OrganismDbi_1.31.0                     
## [11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [12] GenomicFeatures_1.41.0                 
## [13] AnnotationDbi_1.51.0                   
## [14] GenomicAlignments_1.25.0               
## [15] Rsamtools_2.5.0                        
## [16] Biostrings_2.57.0                      
## [17] XVector_0.29.0                         
## [18] SummarizedExperiment_1.19.2            
## [19] DelayedArray_0.15.1                    
## [20] matrixStats_0.56.0                     
## [21] Biobase_2.49.0                         
## [22] GenomicRanges_1.41.1                   
## [23] GenomeInfoDb_1.25.0                    
## [24] IRanges_2.23.4                         
## [25] S4Vectors_0.27.5                       
## [26] BiocGenerics_0.35.2                    
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149           bitops_1.0-6           bit64_0.9-7           
##  [4] progress_1.2.2         httr_1.4.1             tools_4.0.3           
##  [7] R6_2.4.1               DBI_1.1.0              mgcv_1.8-33           
## [10] colorspace_1.4-1       withr_2.2.0            tidyselect_1.1.0      
## [13] prettyunits_1.1.1      bit_1.1-15.2           curl_4.3              
## [16] compiler_4.0.3         graph_1.67.0           cli_2.0.2             
## [19] labeling_0.3           scales_1.1.1           RBGL_1.65.0           
## [22] askpass_1.1            rappdirs_0.3.1         stringr_1.4.0         
## [25] digest_0.6.25          rmarkdown_2.1          pkgconfig_2.0.3       
## [28] htmltools_0.4.0        dbplyr_1.4.3           rlang_1.0.5           
## [31] rstudioapi_0.11        RSQLite_2.2.0          farver_2.0.3          
## [34] BiocParallel_1.23.0    dplyr_0.8.5            RCurl_1.98-1.2        
## [37] magrittr_1.5           GenomeInfoDbData_1.2.3 Matrix_1.4-1          
## [40] Rcpp_1.0.9             munsell_0.5.0          fansi_0.4.1           
## [43] lifecycle_0.2.0        stringi_1.4.6          yaml_2.2.1            
## [46] zlibbioc_1.35.0        BiocFileCache_1.13.0   grid_4.0.3            
## [49] blob_1.2.1             crayon_1.3.4           lattice_0.20-41       
## [52] splines_4.0.3          hms_0.5.3              knitr_1.28            
## [55] pillar_1.4.4           biomaRt_2.45.0         XML_3.99-0.3          
## [58] glue_1.4.0             evaluate_0.14          BiocManager_1.30.10   
## [61] vctrs_0.3.0            gtable_0.3.0           openssl_1.4.1         
## [64] purrr_0.3.4            assertthat_0.2.1       xfun_0.13             
## [67] tibble_3.0.1           memoise_1.1.0          ellipsis_0.3.0