In this practical, we will conduct a classical functional genomic analysis on a gene expression dataset. Our analysis will involve hierarchical clustering of samples to identify potential patterns, and we’ll use feature selection methods to improve tissue labeling prediction. Additionally, we’ll cluster genes to determine the number of gene modules, using silhouette analysis to guide our selection. Finally, we’ll explore the gene clusters using gene set enrichment analysis in DAVID, providing interpretable insights into the underlying biological processes.
The data set is stored in a matrix containing the expression levels obtained from a set of microarray experiments, while the rows of the matrix are genes/probes, and the columns of the matrix are samples. The expression matrix is stored in a tabular file named expression.csv
under the directory of lab practical 2. Please load the CSV file into R with the function read.csv()
. Remember to set the argument row.names = "probeID"
to render the first column being the row names of the imported data.frame.
Then, we need to read in another important file called tissue.csv
. This file is a table with rows corresponding to the tissue information of each sample. Please read in this file into R with read.csv()
.
Next, store the expression matrix and the tissue information into a single summarizedExperiment
object named by SE
. The expression matrix should input at the assays=
argument, and the tissue information should input at the colData=
argument.
Also, we should notice that the colData
slot in the summarizedExperiment object must be a DataFrame
object (not data.frame
), so make sure to convert the tissue into DataFrame with the function DataFrame()
.
The reason to construct a summarizedExperiment
object is to tidy things up. Compared to simultaneously managing multiple variables (expression assays, column designs, row’s metadata, and row’s GRanges), it is much easier to code with only one variable that includes all of the data.
To access and assign different parts of a SummarizedExperiment
object, one should use the 4 functions, assays()
, colData()
, rowRanges()
, and metaData()
, as illustrated on the diagram above.
It is hard to find the exact genomic regions for the rows of the microarray data set, so that we will skip the specifying rowRanges
in the current case. However, a typical NGS assay should be coupled with a set of genome intervals by which the quantification is based on.
Task 1.1, Mark: 5%
tissue <- read.csv("tissue.csv")
e <- read.csv("expression.csv", row.names = "probeID")
library(SummarizedExperiment)
SE <- SummarizedExperiment(assays = e, colData = DataFrame(tissue)) #Represent the data into a summarizedExperiment
Examine the summarizedExperiment object as the following:
SE
## class: SummarizedExperiment
## dim: 22215 189
## metadata(0):
## assays(1): ''
## rownames(22215): 1007_s_at 1053_at ... 91920_at 91952_at
## rowData names(0):
## colnames(189): GSM11805.CEL.gz GSM11814.CEL.gz ... GSM307640.CEL.gz
## GSM307641.CEL.gz
## colData names(1): Tissue
table(SE$Tissue)
##
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
Question 1.2, Mark: 7%
Question 1.2: How many genes or probes are included in this set of the experiment? How many samples are there in total? How many unique tissues are there in total? What are the contents of the rownames
and colnames
of the SummarizedExperiment object?
Answer:
Genes or Probes: 22215
Samples: 189
Tissues: 7
rownames
contents: 1007_s_at 1053_at … 91920_at 91952_at
colnames
contents: GSM11805.CEL.gz GSM11814.CEL.gz … GSM307640.CEL.gz GSM307641.CEL.gz
Next, we will perform hierarchical clustering by the columns of the expression array. To perform hierarchical clustering, we have to compute distance metrics between the objects in advance.
The distance metric can be calculated in R using the function dist()
. Be aware that, by default, dist()
is computing distances between rows of the matrix input. So, if we want to cluster by columns, we need to transform the matrix with the function t()
to flip the row and columns.
Then, you could use the hclust()
function to build a hierarchical clustering dendrogram with on the distance metric. In this question, please run hierarchical clustering on Euclidean distance between columns of the expression assay, and store the resulting hclust
object in the variable hc
.
Task 2.1, Mark: 8%
d <- dist( t(assay(SE)) ) #Use assay() to extract the expression matrix in SE
hc <- hclust(d)
library(rafalib)
myplclust(hc, labels=SE$Tissue, lab.col=as.fumeric(as.character(SE$Tissue)),cex=0.5)
After drawing the dendrogram, you should able to see a tree structure representing the arrangement of clusters of samples. We could observe that the splitting of the tree branches has a good correlation with the tissue labels.
Next, we will retrieve a “partition” based on the hierachical clustering dendrogram. To achieve this, we need to apply the function cutree()
. Set the number of clusters (also called cardinality or K) being the number of tissues.
Then, tabularize the 2 factors for cluster partition and the true tissue labeling. Display the contingency table. Evaluate the contingency table with chisq.test()
.
Then, calculate the empirical mutual information between the 2 factors with DescTools::MutInf()
.
Mutual information is a measure of the mutual dependence between two random variables. If the 2 variables are independent, their mutual information will be 0. MI raises if more information gain can be obtained when modeling the dependency between the 2 random variables (compared with the independence modeling).
Task 2.2, Mark: 8%
hclusters <- cutree(hc, k=7)
tb <- table(true = SE$Tissue, cluster = hclusters)
tb
## cluster
## true 1 2 3 4 5 6 7
## cerebellum 0 0 36 0 0 2 0
## colon 0 0 0 34 0 0 0
## endometrium 15 0 0 0 0 0 0
## hippocampus 0 12 19 0 0 0 0
## kidney 37 0 0 0 0 2 0
## liver 0 0 0 0 24 2 0
## placenta 0 0 0 0 0 0 6
chisq.test(tb)
## Warning in chisq.test(tb): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tb
## X-squared = 787.33, df = 36, p-value < 2.2e-16
DescTools::MutInf(SE$Tissue, hclusters)
## [1] 2.090514
Question 2.3, Mark: 7%
Question 2.3: What are the calculated Chi-squared statistics and mutual information? Is the cluster partition significantly associate with the tissue labeling? Is the prediction on tissues perfect or not? Please explain your rationales.
Answer:
I get those informations: X-squared = 787.33 and df = 36 and p-value < 2.2e-16 and the empirical mutual information is 2.090514.
Yes, obviously they have strong significant level because p-value is much more smaller than 0.05.
From the Pearson’s Chi-squared test and the empirical mutual results the prediction on tissues is not very perfect.
Change the clustering method into “ward.D2” by setting the method
argument of the hclust()
function. Repeat the above analysis, see if the mutual information is improved or not.
Task 2.4, Mark: 5%
hc <- hclust(dist(t(assay(SE))), method = "ward.D2")
hclusters <- cutree(hc, k=7)
tb <- table(true = SE$Tissue, cluster = hclusters)
tb
## cluster
## true 1 2 3 4 5 6 7
## cerebellum 0 5 31 0 0 2 0
## colon 0 0 0 34 0 0 0
## endometrium 15 0 0 0 0 0 0
## hippocampus 0 31 0 0 0 0 0
## kidney 37 0 0 0 0 2 0
## liver 0 0 0 0 24 2 0
## placenta 0 0 0 0 0 0 6
DescTools::MutInf(SE$Tissue, hclusters)
## [1] 2.250404
Then, we will use a simple but most popular feature extraction technique: keeping only a small number of genes that have the highest variance. This time, we will keep the 85 rows with the highest row variances.
You could use the function rowVar()
defined in the package geneFilter
to calculate the variances for genes. Save the index of the selected rows into a variable called idx
. Using the method ward.D2, conduct hierarchical clustering on columns with the rows subset by idx
. Please draw the dendrogram and evaluate the mutual information again.
Subsequently, run the prepared code below to plot a heatmap.
Task 3.1, Mark: 8%
library(genefilter)
rv <- rowVars(e)
idx <- order(-rv)[1:85]
d <- dist(t(assay(SE)))
hc <- hclust(d, method = "ward.D2")
hclusters <- cutree(hc, k=7)
tb <- table(true = SE$Tissue, cluster = hclusters)
tb
## cluster
## true 1 2 3 4 5 6 7
## cerebellum 0 5 31 0 0 2 0
## colon 0 0 0 34 0 0 0
## endometrium 15 0 0 0 0 0 0
## hippocampus 0 31 0 0 0 0 0
## kidney 37 0 0 0 0 2 0
## liver 0 0 0 0 24 2 0
## placenta 0 0 0 0 0 0 6
DescTools::MutInf(SE$Tissue, hclusters)
## [1] 2.250404
## ===== Your code is finished =============== ##
library(pheatmap)
library(RColorBrewer)
annotation_col = data.frame(
Tissue = SE$Tissue
)
rownames(annotation_col) = colnames(SE)
hmcol <- colorRampPalette(rev(brewer.pal(n = 9, name = "GnBu")))(100)
cols <- palette(brewer.pal(8, "Dark2"))[as.fumeric(as.character(unique(SE$Tissue)))]
names(cols) <- unique(SE$Tissue)
pheatmap(assay(SE)[idx,],
color = hmcol,
show_rownames=FALSE,
show_colnames=FALSE,
annotation_col=annotation_col,
annotation_colors = list(Tissue = cols),
scale = "none",
clustering_method="ward.D2",
clustering_distance_cols="euclidean")
Question 3.2, Mark: 7%
Question 3.2: Based on the mutual information and the heatmap, for this time, could we better classify the tissues with only 85 genes? Try to explain why we can obtain the observed performance using less than 0.5% of the total genes?
Answer:
Yes, we could better classify the tissues with only 85 genes. Because this time we calculate the 85 rows with the highest row variances and then analysis it.
Question 3.3, Mark: 7%
Question 3.3: Will we overestimate the performance if we exhaustively search all possible subsets of genes? Please provide your reasons. If the answer is yes, what strategy could be used to deal with this issue?
Answer:
Yes, there might be some batch effect in the whole dataset. What we can do to solve this is that we can do some batch effect correction like Hormony.
The quality of the clustering partition can be determined by the average Silhouette values for each cluster.
The silhouette values quantify how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
In the following steps, we need to calculate the Silhouette values of gene clustering. We will use the pam()
function defined in cluster
package to perform this analysis. The function will run K-medoids clustering on rows of the input matrix. Set k = 2 so that we will have 2 gene clusters at first. Please use the same 85 rows that have been selected in the last question. Save the output of pam()
into a variable named by pamclu
.
Task 4.1, Mark: 8%
set.seed(101)
library(cluster)
pamclu = pam(assay(SE)[idx,], k = 2) #Set k = 2, make sure to subset with idx first!
## ===== Your code is finished =============== ##
plot(silhouette(pamclu), main=NULL)
Then, we will try to compute the average silhouette values using different ks. Fill in the middle of the code chunk below to calculate the summarized silhouette statistics for each k. Remember that we need a FOR loop to do this kind of procedure.
Ks = sapply(2:7, function(i) summary(silhouette(pam(assay(SE)[idx,], k = i)))$avg.width)
## ===== Your code is finished =============== ##
plot(2:7,Ks,xlab="k",ylab="av.silhouette",type="b",pch=19)
Question 4.2, Mark: 7%
Question 4.2: When k = 2, is every single gene well justified in their own clusteitrs? Within our tested ks, what is the optimal number of gene modules? What is the least likely number? Please explain the reason.
Answer:
No, it can’t well justify. K=3 is the best option. K=5 is the least likely number. Because the av.silhouette is lowest and according to the theory of silhouette values(a high value indicates that the object is well matched to its own cluster, vise versa), all shows that K=5 is the least likely number in 2~7.
Run hierachical clustering on the 85 genes using method ward.D2, cut the tree using the optimized k derived from the previous step. Store the output of cutree()
in a variable named by genecluster
.
Task 5.1, Mark: 8%
d <- dist(assay(SE)[idx,]) #make sure to subset with idx first!
hc <- hclust(d, method = "ward.D2")
genecluster <- cutree(hc, k=3)
## ===== Your code is finished =============== ##
sapply(unique(genecluster), function(i) writeLines(names(genecluster)[genecluster == i], paste0("List_",i,".txt")) )
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
writeLines(names(genecluster), "Background.txt")
After running the last 3 lines of code, you will obtain 3 sets of probe ids saved under the practical directory. Next, go to the website of DAVID.
Do the following steps:
Click Start Analysis >> Upload >> Copy & Paste genes in List_i.txt >> Select Gene List
>> Submit list
Upload >> Copy & Paste genes in Background.txt >> select Background
>> Submit
click List >> Select List_i >> Use >> Click Background >> Select Background_1 >> Use
Clear all >> Pathways >> KEGG_PATHWAY
Functional annotation chart >> Functional annotation chart
Repeat step 3, 4, and 5 for all the List_is
You can find the explaination for the functional enrichment results at here
Question 5.2, Mark: 15%
Question 5.2: Report the top five KEGG terms enriched (if existed) for each cluster. Then, explain what are the columns of the P-value
and Benjamini
means, and how could we judge each term are statistically significant or not. In combination with the heatmap and the tissue information, please try to present some biological explanations for the enriched terms.
Answer:
List_1: Complement and coagulation cascades, Metabolic pathways
List_2: NA
List_3: Protein digestion and absorption, Amoebiasis, Relaxin signaling pathway, AGE-RAGE signaling pathway in diabetic complications
P-value
means the probability of obtaining a test statistic equal or more extreme than the observed one, given that the null hypothesis is true.
Benjamini
means the false discovery rate (FDR) correction which is the expected proportion of false positives among all significant results.
Kegg pathway anlysis is considered significant if its P-value is below the threshold (0.05) and its FDR-corrected P-value (Benjamini measure) is also below the threshold (0.05).
In combination with the heatmap and the tissue information, I found that the AGE-RAGE signaling pathway was significantly associated with the kidney and liver. AGEs mediated by RAGE can induce oxidative stress and chronic inflammation in diabetic and aging kidneys. Related studies have also shown that the binding strength of RAGE ligand in the kidney and liver increases with age.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] cluster_2.1.0 RColorBrewer_1.1-2
## [3] pheatmap_1.0.12 genefilter_1.71.0
## [5] rafalib_1.0.0 SummarizedExperiment_1.19.2
## [7] DelayedArray_0.15.1 matrixStats_0.56.0
## [9] Biobase_2.49.0 GenomicRanges_1.41.1
## [11] GenomeInfoDb_1.25.0 IRanges_2.23.4
## [13] S4Vectors_0.27.5 BiocGenerics_0.35.2
##
## loaded via a namespace (and not attached):
## [1] xfun_0.13 splines_4.0.3 lattice_0.20-41
## [4] colorspace_1.4-1 vctrs_0.3.0 expm_0.999-6
## [7] htmltools_0.4.0 yaml_2.2.1 blob_1.2.1
## [10] XML_3.99-0.3 survival_3.2-7 rlang_0.4.6
## [13] e1071_1.7-3 DBI_1.1.0 bit64_0.9-7
## [16] lifecycle_0.2.0 GenomeInfoDbData_1.2.3 rootSolve_1.8.2.1
## [19] stringr_1.4.0 zlibbioc_1.35.0 munsell_0.5.0
## [22] gtable_0.3.0 mvtnorm_1.1-0 evaluate_0.14
## [25] memoise_1.1.0 knitr_1.28 lmom_2.8
## [28] class_7.3-17 AnnotationDbi_1.51.0 Rcpp_1.0.4.6
## [31] xtable_1.8-4 scales_1.1.1 annotate_1.67.0
## [34] XVector_0.29.0 farver_2.0.3 bit_1.1-15.2
## [37] gld_2.6.2 Exact_2.1 digest_0.6.25
## [40] stringi_1.4.6 grid_4.0.3 tools_4.0.3
## [43] bitops_1.0-6 magrittr_1.5 DescTools_0.99.40
## [46] RCurl_1.98-1.2 RSQLite_2.2.0 MASS_7.3-53
## [49] Matrix_1.2-18 rmarkdown_2.1 rstudioapi_0.11
## [52] R6_2.4.1 boot_1.3-25 compiler_4.0.3