cellranger是处理从 10X Genomics 平台生成的单细胞 RNA-seq 数据的命令行工具。cellranger 由 10X Genomics 构建,专门用于处理从其平台生成的数据,然而有几种开源工具可以与 cellranger 竞争执行这些处理任务,例如 kallisto-bustools。
Background and raw data format
single-cell RNA-seq experiments with 10X Genomics
Here I will very briefly go over the basics of performing a single-cell RNA-seq experiment with 10X Genomics. You can watch a great video from 10X about how it all works here.
Before running the 10X Genomics Chromium machine, we must first create a single cell suspension, essentially a tiny tube filled with nothing but cells. We start with a tissue of interest, in our lab’s case the tissue samples come from the brain. The brain is composed of many different cell types, including neurons, astrocytes, oligodendrocytes, microglia, and several others.
For various reasons, single-nucleus genomic assays are more commonly used to study brain tissue rather than single-cell assays. We can lyse (burst open) all cells in a tissue, remove any extra cellular debris, gather up the nuclei, and perform a successful 10X experiment. DNA lives in the nucleus, which is also the part of the cell where the process of transcription (conversion from DNA to RNA) occurs. RNA splicing occurs in the nucleus as well, which occurs before mature mRNA molecules are trafficked out of the nucleus and into other parts of the cell. Using single-nucleus RNA-seq, or snRNA-seq, requires some additional considerations, especially with regards to analysis pipelines that make assumptions about splicing, such as RNA velocity analysis.
The following figure summarizes the steps in the 10X Genomics experiment following the single-cell or single-nuclei suspension.
The 10X protocol relies on performing micro-reactions inside of tiny bubbles called Gel Bead-in EMulsions (GEMs). Within these droplets, specialized cellular barcodes are ligated (attached) to the cDNA (RNA converted into DNA), so we can bioinformatically resolve which molecule comes from each cell. Additional barcodes called Unique Molecular Identifiers (UMIs) are attached to each cDNA molecule, which are useful in accurately quantifying gene expression. Once we have run the samples through the 10X machine, we end up with barcoded cDNA, which will be used for sequencing library preparation followed by Illumina sequencing.
Illumina Sequencing
Illumina sequencing relies on reading millions of short DNA sequences. If you are interested in exactly how Illumina sequencing works, check out this video. It turns out that from a technical standpoint, it is challenging to accurately read long nucleotide sequences, and thus Illumina sequening is a short-read technology, capturing around 150 nucleotides in a single read. Illumina flow cells have 8 lanes, meaning 8 different samples can be sequenced at the same time. Sequencing samples together in this manner is referred to as sample multiplexing.
Following Illumina sequencing, you end up with a mountain of these short nucleotide sequences, without any sense for where these sequences fit in the genome. At this point we must use sequence alignment algorithms, such as the Burrows-Wheeler transform or the Smith-Waterman algorithm, in order to map these reads to the reference genome. Fortunately for us, there are robust and mature software tools for performing this critical alignment step. There are actually alignment tools tailored specifically for RNA-seq data, such as STAR.
Additionally, Paired-end sequencing is a common technique that increases the accuracy of mapping to the reference genome, utilizing a fixed-length gap in between the two paired reads (summarized in the figure below). Finally, if you are working in a less studied organism that does not have a reference genome, you will have to perform de novo genome assembly.
.fastq
文件格式
Illumina provides a detailed explanation of .fastq
files from Illumina sequencing data here, and I will go through some of the details of this file format below. Let’s look at the first few lines of one of the .fastq
files from one of the snRNA-seq samples using zcat
and head
.
zcat data/AD_NucSeq_2019/SWA_13952_B01_NAN_Lane/Sample-22-GTCGATGC_S56_L003_R2_001.fastq.gz | head
@A00217:98:HJY25DSXX:3:1101:6460:1000 2:N:0:NTCGATGC
CATAAATCAATGGGCTGAAGTTGGAGGAGAGGTTGATTACACAGGAGCATGGGTAATTTTTAGTGTGATGGAATTACCTTAATTTTTGTGATGATTACCC
+
FFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00217:98:HJY25DSXX:3:1101:8214:1000 2:N:0:NTCGATGC
AGTAGCTGTGTACTTTCTGATCGGAGCTGTTTAAGAGTTACGGCTGGTCCCACCTCACTGGATCTTGATGTTAATGTTCAGAGGACTCTAGACAACAACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00217:98:HJY25DSXX:3:1101:9733:1000 2:N:0:NTCGATGC
GATTTAACAAATTGTTCAGCATAACACTTCTAATTAAGTTTATCAAGTTGTACTGTATTAGATAATCAGCAGTGTATCTGGAGTATGTTTAAAGAGAACA
In .fastq
format, there are four lines for each read.
- A sequence identifier about the sequencing run and the cluster.
- The actual nucleotide sequence, can be A, C, T, or G to represent the four DNA nucleotides. Additionally, the letter N can appear in the sequence if the base could not be determined.
- A separator, which is just a +.
- Base call quality scores in Phred format, used for quality control software such as fastqc.
Next we will go over how to use cellranger
to perform alignment and gene quantification steps.
Sequence alignment and gene quantification with cellranger count
In this section I will briefly go over how to use cellranger count
to perform sequence alignment and gene quantification, as well as how to interpret the output of this software. Here is a link to the documentation for cellranger count
. Fortunately, this program is pretty easy to use, it just might take a while to run for each sample. I generally run this program as a job array on the HPC, where one job is for each sample.
Running cellranger-count
To load cellranger on the HPC we must first do the following:
module load cellranger
The following is a basic command that I used for cellranger count
. To do this efficiently, you can write a script to submit each sample as a job on the HPC. We must provide a reference transcriptome for cellranger
to map our reads, and this can be downloaded directly from 10X’s website at this link. Take note of whether you need the human reference genome GRCh38
or the mouse reference genome mm10
.
JOBID="Sample-100"
SAMPLE_IDS="Sample-100-GGTCAATA,Sample-100-ATCTTTAG,Sample-100-CAGAGGCC,Sample-100-TCAGCCGT"
TRANSCRIPTOME="/dfs3/swaruplab/smorabit/pipelines/sn-rna-seq/GRCh38.p12.mrna"
FASTQS="/dfs3/swaruplab/smorabit/data/AD_NucSeq_2019/SWA_13952_B01_NAN_Lane"
cellranger count --id=$JOBID \
--transcriptome=$TRANSCRIPTOME \
--fastqs=$FASTQS \
--sample=$SAMPLE_IDS \
--localcores=16 \
--localmem=64 \
--expect-cells=10000
cellranger-count
输出
If everything ran correctly and did not cause any errors, it is time to check on the output files. All output files are dumped into the outs
directory. The following files and directories are found in the outs
directory:
analysis
cloupe.cloupe
filtered_feature_bc_matrix
filtered_feature_bc_matrix.h5
metrics_summary.csv
molecule_info.h5
possorted_genome_bam.bam
possorted_genome_bam.bam.bai
raw_feature_bc_matrix
raw_feature_bc_matrix.h5
web_summary.html
The aligned sequences are stored in the possorted_genome_bam.bam
file, indexed by the possorted_genome_bam.bam.bai
file. .bam
is a standard format for storing aligned nucleotide sequences. It is a compressed version of the .sam
sequence alignment map format. If you are interested in further understanding how aligned sequence data is formatted, look at this comprehensive document.
Next I will dive deeper into some of these output files that will be more directly relevant for downstream analysis. Check the 10X Genomics website for more details about these output files.
分析总结报告解读
The first file that we want to check on is the web_summary.html
file, which provides an overview of both sample quality and a basic analysis. Below I have included a screenshot of this web summary output.