R:integrating and analysis single-cell datasets

Introduction

In this notebook I will go over several integration techniques for single-cell -omics data. These techniques are suitable for a variety of purposes, such as batch correction, cross-species analysis, and multi-modal analysis. Here are some brief explanations of the methods that I will be covering.

  • SCTransform
    • A normalization technique that uses regularized negative binomial regression to construct a corrected counts matrix whereby technical variation has been removed.
  • Harmony
    • Constructs a shared embedding across datasets using an iterative approach in which cells are grouped using soft k-means clustering, and cluster centroids are computed in order to apply dataset correction factors to each cluster.
  • Liger
    • Uses integrative Non-negative Matrix Factorization (i-NMF) to simultaneously integrate datasets and construct a low-dimensional data space. Additionally the Liger package offers factor specific marker gene analysis.
  • Batchelor
    • Uses a mutual-nearest-neighbors approach to apply dataset specific corrections to a low-dimensional data space such as PCA.
  • BBKNN
    • Similar to Harmony and Batchelor, Batch-balanced k-nearest neighbors (BBKNN) algorithm also applies dataset specific corrections to a low-dimensional data space such as PCA.

Load dataset

Before we do anything, we need to load our dataset. I am going to load the outputs of cellranger count rather than cellranger aggr, and combine into one Seurat object. We will then inspect quality metrics and filter out low-quality cells.

library(Seurat)
library(tidyverse)

in_data_dir <- "/home/sam/AD_NucSeq_2019/cellranger_count_pre-mrna/"
fig_dir <- "/home/sam/AD_NucSeq_2019/batch_correction/liger/non-aggregated/figures/"
out_data_dir <- "/home/sam/AD_NucSeq_2019/batch_correction/liger/non-aggregated/data/"

samples <- dir(in_data_dir)

# create seurat object:
seurat_list <- lapply(samples, function(sample){
  cur_data <- Read10X(paste0(in_data_dir,sample,'/outs/filtered_feature_bc_matrix/'))
  cur_seurat <- CreateSeuratObject(
    counts = cur_data,
    min.cells=3,
    min.features=200,
    project='AD2020'
  )
  cur_seurat$SampleID <- sample
  return(cur_seurat)
})

# merge seurat object
seurat_obj <- merge(x=seurat_list[[1]], y=seurat_list[2:length(seurat_list)])

# clean up
rm(seurat_list)
gc()

# add metadata
sample_metadata <- read.csv(file = "/home/sam/AD_NucSeq_2019/batch_correction/liger/non-aggregated/data/metaData_snRNAseq_fix.csv")
rownames(sample_metadata) <- sample_metadata$Sample.ID
cell_metadata <- sample_metadata[seurat_obj$SampleID,]
for(meta in names(cell_metadata)){
  seurat_obj[[meta]] <- cell_metadata[[meta]]
}

Plotting settings:

library(cowplot)
theme_set(theme_cowplot())
library(RColorBrewer)
library(viridis)

umap_theme <- theme(
  axis.line=element_blank(),
  axis.text.x=element_blank(),
  axis.text.y=element_blank(),
  axis.ticks=element_blank(),
  axis.title.x=element_blank(),
  axis.title.y=element_blank(),
  panel.background=element_blank(),
  panel.border=element_blank(),
  panel.grid.major=element_blank(),
  panel.grid.minor=element_blank(),
  plot.background=element_blank()
)

Check quality metrics and remove low-quality cells

Next, we inspect cell quality metrics such as the percentage of reads mapping to mitochondrial genes, the number of genes detected, and the total number of UMIs detected. Low quality cells are removed based on these criteria. Here you have some freedom to be strict or loose with your filtering criteria, and you have to be careful at this step since these criteria are highly dependent on your particular experiment, which is why we always inspect these distributions before deciding on quality cutoffs.

seurat_obj[["percent.mt"]] <- PercentageFeatureSet(object = seurat_obj, pattern = "^MT-")

pdf(paste0(fig_dir, "qc_violin_plot.pdf"), width=10, height=10)
# png(paste0(fig_dir, "pngs/qc_violin_plot.png"), width=10, height=10, res=250, units='in')
VlnPlot(seurat_obj, group.by="SampleID", features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 2, pt.size=0, )
dev.off()

seurat_obj <- subset(seurat_obj, nFeature_RNA > 200 & nFeature_RNA < 10000 & percent.mt < 10)
seurat_obj <- subset(seurat_obj, SampleID != 'Sample-101')

SCTransform

The first integration technique that I will demonstrate is SCTransform. Out of all of the integration algorithms that I will cover, this is the only one that augments the actual UMI counts matrix, while the others all operate using low-dimensional representations. In practice, SCTransform on its own does not always completely eliminate batch effects, but it works nicely in conjunction with the some of the other integration methods. This analysis is outlined in the following steps:

  • Run SCTransform to normalize UMI counts
  • Reduce deminsionality using PCA & UMAP
  • Clustering

As a warning, I found that SCTransform uses a tremendous amount of memory. On our lab server Zeus, I used over 100 GB of memory to run SCTransform on our 60k cell dataset.

Normalize data using SCTransform

The following block of code shows how to use SCTransform to correct the UMI matrix based on percent.mt and Library.Batch variables.

library(sctransform)

seurat_obj <- SCTransform(
  seurat_obj,
  vars.to.regress = c('percent.mt', 'Library.Batch'),
  verbose=TRUE
)
gc()

saveRDS(seurat_obj, file=paste0(out_data_dir, 'seurat_object_sct.rds'))

Data processing

In the following block of code, we process the normalized counts matrix using PCA, UMAP, and clustering. We will visualize the top genes in the first 12 PCs to demonstrate their ability to separate cells in the data. Additionally we will visualize the standard deviation of the top 20 PCs.

# compute PCA:
seurat_obj <- RunPCA(seurat_obj)

# plot pca heatmap
png(paste0(fig_dir, "pngs/pca_heatmap.png"), width=10, height=10, res=300, units='in')
DimHeatmap(seurat_obj, dims = 1:12, cells = 500, balanced = TRUE)
dev.off()

# plot variance explained by PC:
png(paste0(fig_dir, "pngs/pca_elbow_plot.png"), width=5, height=3, res=300, units='in')
ElbowPlot(seurat_obj)
dev.off()

# UMAP and clustering with top PCs
seurat_obj <- RunUMAP(seurat_obj, reduction='pca', dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction='pca')
seurat_obj <- FindClusters(seurat_obj, resolution = 0.3)