Introduction to section 1

In the first section of this practical, we will implement various dimensional reduction methods on a scRNA-Seq data set. The sequencing data comes from cells of mouse neocortex, which is a brain region governing higher-level cognitive and perceptual functions. The data set is originally produced by the Allen Institute.

The gene expression assay is saved in the RDS file of scRNA_se.rds under the current directory.

The code chunk below read the RDS into R and examine its contents. We could see that the data is a SummarizedExperiment object having 45768 genes and 511 samples (cells). The expression array can be accessed by assays(scRNA_se)[["RPKM"]], and the ground truth label of the cell types can be retrieved by scRNA_se$cell_type.

library(SummarizedExperiment)
scRNA_se <- readRDS("scRNA_se.rds")
scRNA_se 
## class: SummarizedExperiment 
## dim: 45768 511 
## metadata(0):
## assays(1): RPKM
## rownames: NULL
## rowData names(1): gene_id
## colnames: NULL
## colData names(1): cell_type
table(scRNA_se$cell_type) #Biological conditions (experimental design)
## 
## CellType_1 CellType_2 CellType_3 CellType_4 CellType_5 
##         68         68        108        215         52
str(rowData(scRNA_se)$gene_id) #Gene ids
##  chr [1:45768] "0610005C13Rik" "0610006L08Rik" "0610007P14Rik" ...
str(assays(scRNA_se)[["RPKM"]]) #Expression array (in RPKM)
##  num [1:45768, 1:511] 0 0 134 165 0 ...
table(colMeans(assays(scRNA_se)[["RPKM"]])) #Check the column averages 
## 
## 21.8493270407271 
##              511

1-1. Visualizing columns with principal component analysis

We will first visualize the cellular gene expression profiles by plotting the 1st and 2nd PCs of the data.

Task 1.1.1, Mark: 10%: Transform the expression values by log(RPKM+1) and calculate the PCs using R functions cov() and eigen(). Save the results of eigen() into an object named by decomp.

Hint: cov() can calculate the empirical covariance matrix between rows of a matrix input. eigen() can compute the eigen-decomposition of a matrix. The output of eigen() contains 2 parts: the eigen vector matrix (columns are PCs/factors) and the eigen values (variance explained along the PCs). The eigen vector matrix can be accessed by $vectors, and the eigen values can be accessed by $values.

decomp = eigen(cov(log(assays(scRNA_se)[["RPKM"]]+1))) #Implement eigen-decomposition on covariance matrix of log(RPKM+1)
dim(decomp$vectors) #Check dimensionalities of W
## [1] 511 511
PC12_df <- data.frame(PC1=decomp$vectors[,1],
                      PC2=decomp$vectors[,2]) #Make a data.frame of PC1 and PC2
ggplot(cbind(PC12_df,label=scRNA_se$cell_type),
         aes(x=PC1,y=PC2,colour=label))+
         geom_point()+
         theme_classic()+
         scale_color_brewer(palette = "Spectral")+
         labs(title = "PCA of log(RPKM+1)")

Note that the empirical covariance matrix along is sufficient to calculate all the principal components. Hence, PCA can be understood as an algorithm to summarize the covariance structure between data dimensions using a linear model.

cov_matrix <- cov(log(assays(scRNA_se)[["RPKM"]]+1))
sample_ids <- seq_len(ncol(scRNA_se))
image(cov_matrix, x = sample_ids, y = sample_ids) #The covariance structure between columns (samples)

Next, we will try to define performance measures for the dimensional reduction outcome.

Task 1.1.2, Mark: 5% First, run K means clustering (K=5) on the matrix of the top 2 PCs (already stored in the variable PC12_df) with function kmeans(). Subsequently, as we have done in the lab practical 2, compare the clustering result with the ground truth label in scRNA_se$cell_type using metrics of mutual information and chi-squared test statistics.

set.seed(123)
partition <- kmeans(PC12_df, centers = 5)$cluster #Save the clustering results
table(partition, scRNA_se$cell_type) #Contingency table
##          
## partition CellType_1 CellType_2 CellType_3 CellType_4 CellType_5
##         1          0          0         22        182          0
##         2          4         67          0          0          0
##         3          0          0         86         33          0
##         4          0          0          0          0         50
##         5         64          1          0          0          2
DescTools::MutInf(partition, scRNA_se$cell_type) #Mutual information
## [1] 1.630551
chisq.test(partition, scRNA_se$cell_type) #Chi-squared statistics
## 
##  Pearson's Chi-squared test
## 
## data:  partition and scRNA_se$cell_type
## X-squared = 1630.2, df = 16, p-value < 2.2e-16

The obtained clustering partitions are highly correlated with the true cell categories. However, the cell type 3 & 4 are still not successfully identified from the PCs using K-means.

1-2. The effects of normalization and transformation on PCA results

In this section, we will examine the effect of scaling on PCA.

The log transformation was previously applied, as it is a natural transformation for count data.

Task 1.2.1, Mark: 5% Run PCA without log transformation and compute the mutual information on the K-means clustering result as we have done in Task 1.3.

decomp2 = eigen(cov(assays(scRNA_se)[["RPKM"]])) #Run PCA directly on RPKM
PC12_df <- data.frame(PC1=decomp2$vectors[,1],
                      PC2=decomp2$vectors[,2]) 
ggplot(cbind(PC12_df,label=scRNA_se$cell_type),
         aes(x=PC1,y=PC2,colour=label))+
         geom_point()+
         theme_classic()+
         scale_color_brewer(palette = "Spectral")+
         labs(title = "PCA of RPKM")

set.seed(123)
partition <- kmeans(PC12_df, centers = 5)$cluster
DescTools::MutInf(partition, scRNA_se$cell_type)
## [1] 0.5451725

Without log transformation, the dimensional reduction outcome is compromised immediately.

Task 1.2.2, Mark: 5% Run PCA on log(RPKM+1), and replace the empirical covariance matrix with the matrix of Pearson correlation (called by R function cor()). Again, compute the mutual information on the clustering result as we have done previously.

decomp3 = eigen(cor(log(assays(scRNA_se)[["RPKM"]]+1))) #Run normalized PCA on log(RPKM+1)
PC12_df <- data.frame(PC1=decomp3$vectors[,1],
                      PC2=decomp3$vectors[,2]) 
ggplot(cbind(PC12_df,label=scRNA_se$cell_type),
         aes(x=PC1,y=PC2,colour=label))+
         geom_point()+
         theme_classic()+
         scale_color_brewer(palette = "Spectral")+
         labs(title = "PCA of log(RPKM+1) using correlation")

set.seed(123)
partition <- kmeans(PC12_df, centers = 5)$cluster
DescTools::MutInf(partition, scRNA_se$cell_type)
## [1] 1.676422

The PCA using Pearson correlation matrix is equivalent to a regular PCA after re-scaling the columns into unit variances (by dividing column standard deviations), such normalization is important when the columns are in different scales since the dimensions with very high variances can dominate the pattern in the empirical covariance matrix.

In our case of scRNA-Seq, the column standardization is less significant, since the samples are in the same scale (all RPKMs). Also, the variances of columns may contain biological information related to the cell lines, which should be estimated as the factors.

Similarly, the PCA can be performed on the row centered data (with X-rowMeans(X)), such normalization approach can estimate the factors which are independent of the average of the observations. In general, one should always be aware of the interpretation of the factors/PCs when designing the normalization procedures before running the PCA algorithm.

1-3. Dimensional reduction by tSNE and UMAP

There are non-linear dimensional reduction techniques that generally work better than PCA for non UMI(unique molecular indexed) scRNA-Seq. We will try two of the most commonly used non-linear embedding methods: tSNE and UMAP.

Task 1.3.1, Mark: 5% Run tSNE on log(RPKM+1) and compute the mutual information as it is done by the previous steps.

Hint: call tSNE with tsne() in the bioconductor package M3C, the input for this function is the expression assay. Save the output of tsne() in tsne_results.

library(M3C)
set.seed(123)
tsne_results <- tsne(log(assays(scRNA_se)[["RPKM"]]+1))
ggplot(cbind(tsne_results$data,
             label=scRNA_se$cell_type),
       aes(x=X1,y=X2,colour=label))+
       geom_point()+
       theme_classic()+
       scale_color_brewer(palette="Spectral")+
       labs(title = "tSNE on log(RPKM+1)")

set.seed(123)
partition <- kmeans(tsne_results$data, centers = 5)$cluster
DescTools::MutInf(partition, scRNA_se$cell_type)
## [1] 1.864341

Task 1.3.2, Mark: 5% Run UMAP on log(RPKM+1) and compute the mutual information as it is done by the previous steps.

Hint: use umap() from package M3C.

set.seed(123)
umap_results <- umap(log(assays(scRNA_se)[["RPKM"]]+1))
ggplot(cbind(umap_results$data,
             label=scRNA_se$cell_type),
       aes(x=X1,y=X2,colour=label))+
       geom_point()+
       theme_classic()+
       scale_color_brewer(palette="Spectral")+
       labs(title = "UMAP on log(RPKM+1)")

set.seed(123)
partition <- kmeans(umap_results$data, centers = 5)$cluster
DescTools::MutInf(partition, scRNA_se$cell_type)
## [1] 1.986195

We could see that both of the non-linear embedding methods can successfully differentiate most of the cell types defined by the ground truth labels.

Introduction to section 2

The objective of the 2nd section is to motivate and illustrate the usage of DESeq2, a powerful software for differential gene expression analysis (DGEA). We will first explore the statistical properties of NGS count data using an RNA-Seq data set that has 48 biological replicates under 2 biological conditions sequenced by the following study:

Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., … & Barton, G. J. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?. Rna, 22(6), 839-851.

The purpose for designing such data set in the paper is to investigate the relationship between the # of biological replicates and the performances of different testing methods for DGEA. The results of their analysis supports the usage of DESeq2 and DESeq in general situations.

The data set is stored in the ERX425102_se.rds file under the practical folder, which is a SummarizedExperiment object containing the count matrix and the experimental design.

library(SummarizedExperiment)
ERX425102_se <- readRDS("ERX425102_se.rds")
colData(ERX425102_se) #Sample information, 2 biological conditions with 48 biological replicates for each condition
## DataFrame with 96 rows and 2 columns
##     condition   Replicate
##      <factor> <character>
## 1        Snf2       Rep_1
## 2        Snf2       Rep_2
## 3        Snf2       Rep_3
## 4        Snf2       Rep_4
## 5        Snf2       Rep_5
## ...       ...         ...
## 92         WT      Rep_44
## 93         WT      Rep_45
## 94         WT      Rep_46
## 95         WT      Rep_47
## 96         WT      Rep_48
ERX425102_se #Gene expression count over 7131 genes (rows) and 96 samples (columns)
## class: SummarizedExperiment 
## dim: 7131 96 
## metadata(0):
## assays(1): ''
## rownames(7131): 15S_rRNA 21S_rRNA ... not_aligned alignment_not_unique
## rowData names(0):
## colnames: NULL
## colData names(2): condition Replicate

2-1. Data distribution over biological replicates

Since the observations generated by biological replication can be viewed as i.i.d. samples coming from a particular distribution. In this part, we will try to fit the data with 2 commonly used distribution families for counts.

The count values for gene LSR1 under WT condition are extracted and stored in the variable x.

indx_WT <- colData(ERX425102_se)$condition == "WT" 
x <- assay(ERX425102_se)["LSR1",indx_WT]
x #The count for gene LSR1 over 48 biological replicates of condition WT
##  [1]  60 163 233 163 193 375 194  84 211 158  65 138 199 131  99 314 206 411 116
## [20] 465 194 424 226 139 277 401 201 209 153 177 265  80 264 263 223 223 100  55
## [39]  66 385 232 149 114  81 109 132 243 128

Function fitdistr() in package MASS can fit MLE for commonly used univariate parametric distributions. For example, the code chunk bellow fits the MLE on x using Poisson distribution. The log likelihood of the fit can be accessed by logLik().

library(MASS)
set.seed(123)
ff <- fitdistr(x, "Poisson") #MLE of Poisson
ff
##     lambda  
##   197.72917 
##  (  2.02962)
logLik(ff) #log likelihood of Poisson fit
## 'log Lik.' -1382.263 (df=1)
  • Task 2.1.1, Mark: 5%: Within the code chunk bellow, fit the univariate negative binomial model on x; access its MLE and log likelihood as it is done in the Poisson case. Save your distribution fit into the variable name of ff2.
ff2 <- fitdistr(x, "negative binomial")
ff2
##       size           mu     
##     3.9015361   197.7291667 
##  (  0.7821531) ( 14.5906488)
logLik(ff2)
## 'log Lik.' -285.2799 (df=2)
  • Question 2.1.2, Mark: 5%: According to the likelihood scores of the two models, which distribution is more preferred?
  • Your answer: The univariate negative binomial model is better.
values <- 0:600
plot_df <- data.frame(prob = c(dnbinom(values,size=ff2$estimate[1],mu = ff2$estimate[2]),dpois(values,lambda=ff$estimate)),
                      values = rep(values,2),
                      model_fit = rep(c("NB","Poisson"),each=601))

library(ggplot2)
ggplot(plot_df) + geom_line(aes(x=values,y=prob,color=model_fit)) + theme_classic() + 
  scale_color_manual(values=c("blue","red")) + ylab("p(x)") + xlab("x") +
  geom_point(aes(x=obs,y=y), data=data.frame(obs=x,y=0), shape = 3)

Comment: The over-dispersion properties of NB can effectively capture both the biological and technical variations between replicates, while Poisson distribution can only account for the fragment sampling noise of the NGS library (shoot noise).

2-2. Two sample t-test on the difference bewteen two group means

In the following exercises, we will examine the difference between the mean expression levels of the two biological conditions.

After drawing a boxplot for the expression levels of gene “YGR107W” under the two conditions, we could see that their means are very likely to be significantly different.

count_YGR107W <- data.frame(count = assay(ERX425102_se)["YGR107W",],
                            condition = ERX425102_se$condition)
count_YGR107W$condition <- factor(count_YGR107W$condition, levels = c("WT","Snf2")) #Set "WT" to the reference level

ggplot(count_YGR107W, aes(x=condition,y=log(count+1)))+geom_boxplot(outlier.alpha = 0, width = 0.5)+
  geom_jitter(aes(x=condition,y=log(count+1),colour=condition),width = 0.1, alpha=0.8)+theme_bw()+
  scale_colour_brewer(palette = "Set2")

  • Task 2.2.1, Mark: 5% Within the code chunk bellow, compute the sample mean of log(count + 1) under condition WT, then compute the difference between the sample means of Snf2 and WT.
WT <- log(assay(ERX425102_se)["YGR107W",][ERX425102_se$condition=="WT"] + 1)
Snf2 <- log(assay(ERX425102_se)["YGR107W",][ERX425102_se$condition=="Snf2"] + 1)
mean(WT)
## [1] 2.057573
mean(Snf2) - mean(WT)
## [1] 1.533114

Next, we will check the statistical significance of the difference using t-test.

  • Task 2.2.2, Mark: 5% Perform a 2 sample t-test with the function t.test() on the difference between 2 sample means.
t.test(WT, Snf2, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  WT and Snf2
## t = -21.475, df = 91.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.674908 -1.391321
## sample estimates:
## mean of x mean of y 
##  2.057573  3.590687
  • Question 2.2.3, Mark: 5%: Judging from the p-value, is the difference in sample means significant? What is the effect size of the difference?
  • Your answer: Yes, P-value is smaller than 0.05, so the difference in sample means is obviously significant. The effect size of the difference is -0.1753747.

2-3. Linear modeling on the difference bewteen two group means

Next, we will use an alternative approach than t-test to evaluate differential expression. We can fit a linear regression model on log(count + 1) for gene YGR107W; the covariate should be a dummy variable indicating the conditions, with the level “WT” set as the reference group (=0).

\[\begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_{96} \\ \end{pmatrix} =\begin{pmatrix} 1 & 0 \\ 1 & 0 \\ \vdots & \vdots \\ 1 & 1 \\ \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \end{pmatrix} +\begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_{96} \\ \end{pmatrix}=\beta_0\begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \\ \end{pmatrix}\] As it is shown in the matrix notation above, the response variable (y) is the vector of log(count+1) for all the replicates of the gene YGR107W. The covariates (X) is a design matrix for testing the difference between 2 sample means, the first column of the design matrix is a vector of 1 for intercept, and the second column of the design matrix is an indicator variable for biological condition (1 if condition == “Snf2”, and 0 otherwise).

  • Task 2.3.1, Mark 10%: With the linear regression model matrix stored in count_YGR107W, fit a linear regression with lm() using the design described as above. Store the fitted regression model (output of lm()) intolm_fit.
#Use a linear regression model
lm_fit <- lm(formula = log(count + 1) ~ condition, data = count_YGR107W)
lm_fit #See the fitted model coefficients
## 
## Call:
## lm(formula = log(count + 1) ~ condition, data = count_YGR107W)
## 
## Coefficients:
##   (Intercept)  conditionSnf2  
##         2.058          1.533

We could see that the fitted coefficients for \(\beta_0\) and \(\beta_1\) are 2.058 and 1.533, respectively.

As it is demonstrated in the diagram above, under this regression design, \(\hat \beta_0\) captures the average expression level of WT (the reference), and \(\hat \beta_1\) captures the difference between the average gene expression levels.

  • Task 2.3.2, Mark 5% Perform linear model testing with the function summary().
summary(lm_fit)
## 
## Call:
## lm(formula = log(count + 1) ~ condition, data = count_YGR107W)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9590 -0.2234  0.0469  0.2450  0.6860 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.05757    0.05048   40.76   <2e-16 ***
## conditionSnf2  1.53311    0.07139   21.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3497 on 94 degrees of freedom
## Multiple R-squared:  0.8307, Adjusted R-squared:  0.8289 
## F-statistic: 461.2 on 1 and 94 DF,  p-value: < 2.2e-16
  • Question 2.3.3, Mark 5%: The p-value on which coefficient can be interpreted like the two sample t-test?
  • Your answer: p-value < 2.2e-16

Sequencing count data can be modeled using a Poisson generalized linear model (GLM) without the need for log transformation.

  • Task 2.3.4, Mark 5%:

In the following code chunk, you will fit a Poisson GLM using glm() to model the count data directly, using the same design matrix as in task 2.3.1. The response variable (y) will now be the count data without log transformation. After fitting the model, compute the log fold change (LFC) between the means of the two groups. In the end, check the relationship between the estimated coefficient of the grouping variable (\(\hat{\beta_1}\)) and the log fold change (LFC).

glm_fit <- glm(formula = count ~ condition, family = "poisson", data = count_YGR107W) #Fit a Poisson GLM with log link
summary(glm_fit) #Checking GLM coefficients and statistics
## 
## Call:
## glm(formula = count ~ condition, family = "poisson", data = count_YGR107W)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6913  -1.0250  -0.0103   0.6753   4.9432  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.99527    0.05322   37.49   <2e-16 ***
## conditionSnf2  1.61734    0.05827   27.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1273.99  on 95  degrees of freedom
## Residual deviance:  232.07  on 94  degrees of freedom
## AIC: 676.55
## 
## Number of Fisher Scoring iterations: 4
log(mean(assay(ERX425102_se)["YGR107W",][ERX425102_se$condition=="Snf2"]))-log(mean(assay(ERX425102_se)["YGR107W",][ERX425102_se$condition=="WT"])) #Calculate the log fold change of 2 group means
## [1] 1.617339

2-4. Fitting many NB GLMs at once using DESeq2

Performing differential expression analysis for many genes using R can be challenging due to two main issues related to the t.test() and glm() functions.

The t.test() function performs well only when there are at least 30 replicates for each group (for CLT to work out) or the mean counts of the gene are over 20 in all replicates (where log transformed NB distributed variables converges to Gaussian variables). However, meeting these conditions is difficult in practice. Additionally, running t.test() using a for loop over thousands of genes is time-consuming.

The glm() function also has several issues. The Poisson GLM cannot account for the over-dispersed nature of biological replicates, and estimating over-dispersion using NB GLM is inaccurate with limited replicates. Furthermore, running glm() using a for loop over thousands of genes is also slow.

To overcome these challenges, the DESeq2 R package provides an excellent solution. It has a vectorized function that can fit multiple NB GLMs at once using its core code written in C++. Additionally, DESeq2 implements a regularized method for estimating dispersion parameters by sharing information over all genes, which generally improves the accuracy of tests.

  • Task 2.4.1, Mark: 10% Conduct DGEA for all genes using the standard analysis pipeline implemented in DESeq2. Display the resulting test statistics.

Hint: see ?DESeq2::DESeq for the usage of its core function, and use browseVignettes("DESeq2") to look at the user’s guide.

library(DESeq2)
dds <- DESeqDataSet(se = ERX425102_se, design = ~ condition)
## renaming the first element in assays to 'counts'
dds <- DESeq(dds) #Fit n NB GLMs, where n = the # of genes
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 7 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res <- results(dds) #Access to the results of the analysis
res
## log2 fold change (MLE): condition Snf2 vs WT 
## Wald test p-value: condition Snf2 vs WT 
## DataFrame with 7131 rows and 6 columns
##                       baseMean log2FoldChange     lfcSE      stat      pvalue
##                      <numeric>      <numeric> <numeric> <numeric>   <numeric>
## 15S_rRNA              18.33566       0.167494 0.3358680  0.498689 6.17998e-01
## 21S_rRNA             107.32079      -0.093113 0.2781711 -0.334733 7.37827e-01
## HRA1                   2.52619      -0.792982 0.2226176 -3.562082 3.67926e-04
## ICR1                 141.57494       0.215399 0.0370318  5.816605 6.00550e-09
## LSR1                 207.52279      -0.136647 0.1581332 -0.864123 3.87520e-01
## ...                        ...            ...       ...       ...         ...
## no_feature              681521      -0.186217 0.0470641  -3.95667 7.60021e-05
## ambiguous               899512      -0.389648 0.1455694  -2.67672 7.43475e-03
## too_low_aQual                0             NA        NA        NA          NA
## not_aligned                  0             NA        NA        NA          NA
## alignment_not_unique         0             NA        NA        NA          NA
##                             padj
##                        <numeric>
## 15S_rRNA             6.73744e-01
## 21S_rRNA             7.82652e-01
## HRA1                 6.67187e-04
## ICR1                 1.62434e-08
## LSR1                 4.49962e-01
## ...                          ...
## no_feature           0.000146702
## ambiguous            0.011574683
## too_low_aQual                 NA
## not_aligned                   NA
## alignment_not_unique          NA
  • Question 2.3.2, Mark: 5% How many genes are significantly deferentially expressed between the 2 conditions under the criteria of FDR < 0.05? Which additional column in the output can be used to select a subset of more biologically significant DEGs among the list of statistically significant candidates?

  • Answer: There are 2174 genes with an adjusted p-value less than 0.05, which can be considered as significant DEGs. padj column can be used to select a subset of more biologically significant DEGs among the list of statistically significant candidates

Hint: When controlling FDR, directly use the column padj in the DESeq2 output, which is the adjusted p-value using the Benjamini Hochberg approach.

Session info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DESeq2_1.29.0               MASS_7.3-53                
##  [3] M3C_1.12.0                  SummarizedExperiment_1.19.2
##  [5] DelayedArray_0.15.1         matrixStats_0.56.0         
##  [7] Biobase_2.49.0              GenomicRanges_1.41.1       
##  [9] GenomeInfoDb_1.25.0         IRanges_2.23.4             
## [11] S4Vectors_0.27.5            BiocGenerics_0.35.2        
## [13] ggplot2_3.3.0              
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6           bit64_0.9-7            doParallel_1.0.16     
##  [4] RColorBrewer_1.1-2     tools_4.0.3            R6_2.4.1              
##  [7] DBI_1.1.0              colorspace_1.4-1       withr_2.2.0           
## [10] tidyselect_1.1.0       Exact_2.1              bit_1.1-15.2          
## [13] compiler_4.0.3         expm_0.999-6           labeling_0.3          
## [16] scales_1.1.1           mvtnorm_1.1-0          genefilter_1.71.0     
## [19] askpass_1.1            stringr_1.4.0          digest_0.6.25         
## [22] rmarkdown_2.1          XVector_0.29.0         pkgconfig_2.0.3       
## [25] htmltools_0.4.0        umap_0.2.7.0           rlang_0.4.6           
## [28] rstudioapi_0.11        RSQLite_2.2.0          farver_2.0.3          
## [31] jsonlite_1.6.1         BiocParallel_1.23.0    dplyr_0.8.5           
## [34] RCurl_1.98-1.2         magrittr_1.5           GenomeInfoDbData_1.2.3
## [37] Matrix_1.2-18          Rcpp_1.0.4.6           DescTools_0.99.40     
## [40] munsell_0.5.0          reticulate_1.20-9000   lifecycle_0.2.0       
## [43] stringi_1.4.6          yaml_2.2.1             rootSolve_1.8.2.1     
## [46] zlibbioc_1.35.0        Rtsne_0.15             matrixcalc_1.0-3      
## [49] grid_4.0.3             blob_1.2.1             crayon_1.3.4          
## [52] doSNOW_1.0.19          lmom_2.8               lattice_0.20-41       
## [55] splines_4.0.3          annotate_1.67.0        locfit_1.5-9.4        
## [58] knitr_1.28             pillar_1.4.4           boot_1.3-25           
## [61] gld_2.6.2              corpcor_1.6.9          geneplotter_1.67.0    
## [64] codetools_0.2-16       XML_3.99-0.3           glue_1.4.0            
## [67] evaluate_0.14          vctrs_0.3.0            png_0.1-7             
## [70] foreach_1.5.0          gtable_0.3.0           openssl_1.4.1         
## [73] purrr_0.3.4            assertthat_0.2.1       xfun_0.13             
## [76] xtable_1.8-4           e1071_1.7-3            RSpectra_0.16-0       
## [79] survival_3.2-7         class_7.3-17           tibble_3.0.1          
## [82] snow_0.4-3             iterators_1.0.12       memoise_1.1.0         
## [85] AnnotationDbi_1.51.0   cluster_2.1.0          ellipsis_0.3.0