The Outline for lab practical 2

In this practical, we will conduct a classical functional genomic analysis on a gene expression dataset. Our analysis will involve hierarchical clustering of samples to identify potential patterns, and we’ll use feature selection methods to improve tissue labeling prediction. Additionally, we’ll cluster genes to determine the number of gene modules, using silhouette analysis to guide our selection. Finally, we’ll explore the gene clusters using gene set enrichment analysis in DAVID, providing interpretable insights into the underlying biological processes.

1. Construct SummarizedExperiment Object using Expression Matrix and Experimental Design

The data set is stored in a matrix containing the expression levels obtained from a set of microarray experiments, while the rows of the matrix are genes/probes, and the columns of the matrix are samples. The expression matrix is stored in a tabular file named expression.csv under the directory of lab practical 2. Please load the CSV file into R with the function read.csv(). Remember to set the argument row.names = "probeID" to render the first column being the row names of the imported data.frame.

Then, we need to read in another important file called tissue.csv. This file is a table with rows corresponding to the tissue information of each sample. Please read in this file into R with read.csv().

Next, store the expression matrix and the tissue information into a single summarizedExperiment object named by SE. The expression matrix should input at the assays= argument, and the tissue information should input at the colData= argument.

Also, we should notice that the colData slot in the summarizedExperiment object must be a DataFrame object (not data.frame), so make sure to convert the tissue into DataFrame with the function DataFrame().

The reason to construct a summarizedExperiment object is to tidy things up. Compared to simultaneously managing multiple variables (expression assays, column designs, row’s metadata, and row’s GRanges), it is much easier to code with only one variable that includes all of the data.

.

To access and assign different parts of a SummarizedExperiment object, one should use the 4 functions, assays(), colData(), rowRanges(), and metaData(), as illustrated on the diagram above.

It is hard to find the exact genomic regions for the rows of the microarray data set, so that we will skip the specifying rowRanges in the current case. However, a typical NGS assay should be coupled with a set of genome intervals by which the quantification is based on.

Task 1.1, Mark: 5%

tissue <- read.csv("tissue.csv")
e <- read.csv("expression.csv", row.names = "probeID")
library(SummarizedExperiment)
SE <- SummarizedExperiment(assays = e, colData = DataFrame(tissue)) #Represent the data into a summarizedExperiment

Examine the summarizedExperiment object as the following:

SE
## class: SummarizedExperiment 
## dim: 22215 189 
## metadata(0):
## assays(1): ''
## rownames(22215): 1007_s_at 1053_at ... 91920_at 91952_at
## rowData names(0):
## colnames(189): GSM11805.CEL.gz GSM11814.CEL.gz ... GSM307640.CEL.gz
##   GSM307641.CEL.gz
## colData names(1): Tissue
table(SE$Tissue)
## 
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Question 1.2, Mark: 7%

2. Hierarchical Clustering by Samples

Next, we will perform hierarchical clustering by the columns of the expression array. To perform hierarchical clustering, we have to compute distance metrics between the objects in advance.

The distance metric can be calculated in R using the function dist(). Be aware that, by default, dist() is computing distances between rows of the matrix input. So, if we want to cluster by columns, we need to transform the matrix with the function t() to flip the row and columns.

Then, you could use the hclust() function to build a hierarchical clustering dendrogram with on the distance metric. In this question, please run hierarchical clustering on Euclidean distance between columns of the expression assay, and store the resulting hclust object in the variable hc.

Task 2.1, Mark: 8%

d <- dist( t(assay(SE)) ) #Use assay() to extract the expression matrix in SE
hc <- hclust(d)
library(rafalib)
myplclust(hc, labels=SE$Tissue, lab.col=as.fumeric(as.character(SE$Tissue)),cex=0.5)

After drawing the dendrogram, you should able to see a tree structure representing the arrangement of clusters of samples. We could observe that the splitting of the tree branches has a good correlation with the tissue labels.

Next, we will retrieve a “partition” based on the hierachical clustering dendrogram. To achieve this, we need to apply the function cutree(). Set the number of clusters (also called cardinality or K) being the number of tissues.

Then, tabularize the 2 factors for cluster partition and the true tissue labeling. Display the contingency table. Evaluate the contingency table with chisq.test().

Then, calculate the empirical mutual information between the 2 factors with DescTools::MutInf().

Mutual information is a measure of the mutual dependence between two random variables. If the 2 variables are independent, their mutual information will be 0. MI raises if more information gain can be obtained when modeling the dependency between the 2 random variables (compared with the independence modeling).

Task 2.2, Mark: 8%

hclusters <- cutree(hc, k=7)
tb <- table(true = SE$Tissue, cluster = hclusters)
tb
##              cluster
## true           1  2  3  4  5  6  7
##   cerebellum   0  0 36  0  0  2  0
##   colon        0  0  0 34  0  0  0
##   endometrium 15  0  0  0  0  0  0
##   hippocampus  0 12 19  0  0  0  0
##   kidney      37  0  0  0  0  2  0
##   liver        0  0  0  0 24  2  0
##   placenta     0  0  0  0  0  0  6
chisq.test(tb)
## Warning in chisq.test(tb): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  tb
## X-squared = 787.33, df = 36, p-value < 2.2e-16
DescTools::MutInf(SE$Tissue, hclusters)
## [1] 2.090514

Question 2.3, Mark: 7%

Change the clustering method into “ward.D2” by setting the method argument of the hclust() function. Repeat the above analysis, see if the mutual information is improved or not.

Task 2.4, Mark: 5%

hc <- hclust(dist(t(assay(SE))), method = "ward.D2")
hclusters <- cutree(hc, k=7)
tb <- table(true = SE$Tissue, cluster = hclusters)
tb
##              cluster
## true           1  2  3  4  5  6  7
##   cerebellum   0  5 31  0  0  2  0
##   colon        0  0  0 34  0  0  0
##   endometrium 15  0  0  0  0  0  0
##   hippocampus  0 31  0  0  0  0  0
##   kidney      37  0  0  0  0  2  0
##   liver        0  0  0  0 24  2  0
##   placenta     0  0  0  0  0  0  6
DescTools::MutInf(SE$Tissue, hclusters)
## [1] 2.250404

3. Feature Selection and Heatmap

Then, we will use a simple but most popular feature extraction technique: keeping only a small number of genes that have the highest variance. This time, we will keep the 85 rows with the highest row variances.

You could use the function rowVar() defined in the package geneFilter to calculate the variances for genes. Save the index of the selected rows into a variable called idx. Using the method ward.D2, conduct hierarchical clustering on columns with the rows subset by idx. Please draw the dendrogram and evaluate the mutual information again.

Subsequently, run the prepared code below to plot a heatmap.

Task 3.1, Mark: 8%

library(genefilter)
rv <- rowVars(e)
idx <- order(-rv)[1:85]
d <- dist(t(assay(SE)))
hc <- hclust(d, method = "ward.D2")
hclusters <- cutree(hc, k=7)
tb <- table(true = SE$Tissue, cluster = hclusters)
tb
##              cluster
## true           1  2  3  4  5  6  7
##   cerebellum   0  5 31  0  0  2  0
##   colon        0  0  0 34  0  0  0
##   endometrium 15  0  0  0  0  0  0
##   hippocampus  0 31  0  0  0  0  0
##   kidney      37  0  0  0  0  2  0
##   liver        0  0  0  0 24  2  0
##   placenta     0  0  0  0  0  0  6
DescTools::MutInf(SE$Tissue, hclusters)
## [1] 2.250404
## ===== Your code is finished  =============== ##
library(pheatmap)
library(RColorBrewer)

annotation_col = data.frame(
  Tissue = SE$Tissue
  )
rownames(annotation_col) = colnames(SE)

hmcol <- colorRampPalette(rev(brewer.pal(n = 9, name = "GnBu")))(100)
cols <- palette(brewer.pal(8, "Dark2"))[as.fumeric(as.character(unique(SE$Tissue)))]
names(cols) <- unique(SE$Tissue)

pheatmap(assay(SE)[idx,],
         color = hmcol,
         show_rownames=FALSE,
         show_colnames=FALSE,
         annotation_col=annotation_col,
         annotation_colors = list(Tissue = cols),
         scale = "none",
         clustering_method="ward.D2",
         clustering_distance_cols="euclidean")

Question 3.2, Mark: 7%

Question 3.3, Mark: 7%

4. Determine the Number of Gene Clusters

The quality of the clustering partition can be determined by the average Silhouette values for each cluster.

The silhouette values quantify how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

In the following steps, we need to calculate the Silhouette values of gene clustering. We will use the pam() function defined in cluster package to perform this analysis. The function will run K-medoids clustering on rows of the input matrix. Set k = 2 so that we will have 2 gene clusters at first. Please use the same 85 rows that have been selected in the last question. Save the output of pam() into a variable named by pamclu.

Task 4.1, Mark: 8%

set.seed(101)
library(cluster)
pamclu = pam(assay(SE)[idx,], k = 2) #Set k = 2, make sure to subset with idx first!
## ===== Your code is finished  =============== ##
plot(silhouette(pamclu), main=NULL)

Then, we will try to compute the average silhouette values using different ks. Fill in the middle of the code chunk below to calculate the summarized silhouette statistics for each k. Remember that we need a FOR loop to do this kind of procedure.

Ks = sapply(2:7, function(i) summary(silhouette(pam(assay(SE)[idx,], k = i)))$avg.width)
## ===== Your code is finished  =============== ##
plot(2:7,Ks,xlab="k",ylab="av.silhouette",type="b",pch=19)

Question 4.2, Mark: 7%

5. Functional Characterization of the Gene Modules

Run hierachical clustering on the 85 genes using method ward.D2, cut the tree using the optimized k derived from the previous step. Store the output of cutree() in a variable named by genecluster.

Task 5.1, Mark: 8%

d <- dist(assay(SE)[idx,]) #make sure to subset with idx first!
hc <- hclust(d, method = "ward.D2")
genecluster <- cutree(hc, k=3)
## ===== Your code is finished  =============== ##
sapply(unique(genecluster), function(i) writeLines(names(genecluster)[genecluster == i], paste0("List_",i,".txt")) )
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
writeLines(names(genecluster), "Background.txt")

After running the last 3 lines of code, you will obtain 3 sets of probe ids saved under the practical directory. Next, go to the website of DAVID.

Do the following steps:

  1. Click Start Analysis >> Upload >> Copy & Paste genes in List_i.txt >> Select Gene List >> Submit list

  2. Upload >> Copy & Paste genes in Background.txt >> select Background >> Submit

  3. click List >> Select List_i >> Use >> Click Background >> Select Background_1 >> Use

  4. Clear all >> Pathways >> KEGG_PATHWAY

  5. Functional annotation chart >> Functional annotation chart

  6. Repeat step 3, 4, and 5 for all the List_is

You can find the explaination for the functional enrichment results at here

Question 5.2, Mark: 15%

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cluster_2.1.0               RColorBrewer_1.1-2         
##  [3] pheatmap_1.0.12             genefilter_1.71.0          
##  [5] rafalib_1.0.0               SummarizedExperiment_1.19.2
##  [7] DelayedArray_0.15.1         matrixStats_0.56.0         
##  [9] Biobase_2.49.0              GenomicRanges_1.41.1       
## [11] GenomeInfoDb_1.25.0         IRanges_2.23.4             
## [13] S4Vectors_0.27.5            BiocGenerics_0.35.2        
## 
## loaded via a namespace (and not attached):
##  [1] xfun_0.13              splines_4.0.3          lattice_0.20-41       
##  [4] colorspace_1.4-1       vctrs_0.3.0            expm_0.999-6          
##  [7] htmltools_0.4.0        yaml_2.2.1             blob_1.2.1            
## [10] XML_3.99-0.3           survival_3.2-7         rlang_0.4.6           
## [13] e1071_1.7-3            DBI_1.1.0              bit64_0.9-7           
## [16] lifecycle_0.2.0        GenomeInfoDbData_1.2.3 rootSolve_1.8.2.1     
## [19] stringr_1.4.0          zlibbioc_1.35.0        munsell_0.5.0         
## [22] gtable_0.3.0           mvtnorm_1.1-0          evaluate_0.14         
## [25] memoise_1.1.0          knitr_1.28             lmom_2.8              
## [28] class_7.3-17           AnnotationDbi_1.51.0   Rcpp_1.0.4.6          
## [31] xtable_1.8-4           scales_1.1.1           annotate_1.67.0       
## [34] XVector_0.29.0         farver_2.0.3           bit_1.1-15.2          
## [37] gld_2.6.2              Exact_2.1              digest_0.6.25         
## [40] stringi_1.4.6          grid_4.0.3             tools_4.0.3           
## [43] bitops_1.0-6           magrittr_1.5           DescTools_0.99.40     
## [46] RCurl_1.98-1.2         RSQLite_2.2.0          MASS_7.3-53           
## [49] Matrix_1.2-18          rmarkdown_2.1          rstudioapi_0.11       
## [52] R6_2.4.1               boot_1.3-25            compiler_4.0.3