cellranger是处理从 10X Genomics 平台生成的单细胞 RNA-seq 数据的命令行工具。cellranger 由 10X Genomics 构建,专门用于处理从其平台生成的数据,然而有几种开源工具可以与 cellranger 竞争执行这些处理任务,例如 kallisto-bustools。
Background and raw data format
single-cell RNA-seq experiments with 10X Genomics
Here I will very briefly go over the basics of performing a single-cell RNA-seq experiment with 10X Genomics. You can watch a great video from 10X about how it all works here.
Before running the 10X Genomics Chromium machine, we must first create a single cell suspension, essentially a tiny tube filled with nothing but cells. We start with a tissue of interest, in our lab’s case the tissue samples come from the brain. The brain is composed of many different cell types, including neurons, astrocytes, oligodendrocytes, microglia, and several others.
neuron cartoon from Wikipedia
For various reasons, single-nucleus genomic assays are more commonly used to study brain tissue rather than single-cell assays. We can lyse (burst open) all cells in a tissue, remove any extra cellular debris, gather up the nuclei, and perform a successful 10X experiment. DNA lives in the nucleus, which is also the part of the cell where the process of transcription (conversion from DNA to RNA) occurs. RNA splicing occurs in the nucleus as well, which occurs before mature mRNA molecules are trafficked out of the nucleus and into other parts of the cell. Using single-nucleus RNA-seq, or snRNA-seq, requires some additional considerations, especially with regards to analysis pipelines that make assumptions about splicing, such as RNA velocity analysis.
The following figure summarizes the steps in the 10X Genomics experiment following the single-cell or single-nuclei suspension.
The 10X protocol relies on performing micro-reactions inside of tiny bubbles called Gel Bead-in EMulsions (GEMs). Within these droplets, specialized cellular barcodes are ligated (attached) to the cDNA (RNA converted into DNA), so we can bioinformatically resolve which molecule comes from each cell. Additional barcodes called Unique Molecular Identifiers (UMIs) are attached to each cDNA molecule, which are useful in accurately quantifying gene expression. Once we have run the samples through the 10X machine, we end up with barcoded cDNA, which will be used for sequencing library preparation followed by Illumina sequencing.
Illumina Sequencing
Illumina sequencing relies on reading millions of short DNA sequences. If you are interested in exactly how Illumina sequencing works, check out this video. It turns out that from a technical standpoint, it is challenging to accurately read long nucleotide sequences, and thus Illumina sequening is a short-read technology, capturing around 150 nucleotides in a single read. Illumina flow cells have 8 lanes, meaning 8 different samples can be sequenced at the same time. Sequencing samples together in this manner is referred to as sample multiplexing.
Following Illumina sequencing, you end up with a mountain of these short nucleotide sequences, without any sense for where these sequences fit in the genome. At this point we must use sequence alignment algorithms, such as the Burrows-Wheeler transform or the Smith-Waterman algorithm, in order to map these reads to the reference genome. Fortunately for us, there are robust and mature software tools for performing this critical alignment step. There are actually alignment tools tailored specifically for RNA-seq data, such as STAR.
Additionally, Paired-end sequencing is a common technique that increases the accuracy of mapping to the reference genome, utilizing a fixed-length gap in between the two paired reads (summarized in the figure below). Finally, if you are working in a less studied organism that does not have a reference genome, you will have to perform de novo genome assembly.
Source: Illumina
.fastq
文件格式
Illumina provides a detailed explanation of .fastq
files from Illumina sequencing data here, and I will go through some of the details of this file format below. Let’s look at the first few lines of one of the .fastq
files from one of the snRNA-seq samples using zcat
and head
.
zcat data/AD_NucSeq_2019/SWA_13952_B01_NAN_Lane/Sample-22-GTCGATGC_S56_L003_R2_001.fastq.gz | head
@A00217:98:HJY25DSXX:3:1101:6460:1000 2:N:0:NTCGATGC
CATAAATCAATGGGCTGAAGTTGGAGGAGAGGTTGATTACACAGGAGCATGGGTAATTTTTAGTGTGATGGAATTACCTTAATTTTTGTGATGATTACCC
+
FFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00217:98:HJY25DSXX:3:1101:8214:1000 2:N:0:NTCGATGC
AGTAGCTGTGTACTTTCTGATCGGAGCTGTTTAAGAGTTACGGCTGGTCCCACCTCACTGGATCTTGATGTTAATGTTCAGAGGACTCTAGACAACAACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00217:98:HJY25DSXX:3:1101:9733:1000 2:N:0:NTCGATGC
GATTTAACAAATTGTTCAGCATAACACTTCTAATTAAGTTTATCAAGTTGTACTGTATTAGATAATCAGCAGTGTATCTGGAGTATGTTTAAAGAGAACA
In .fastq
format, there are four lines for each read.
- A sequence identifier about the sequencing run and the cluster.
- The actual nucleotide sequence, can be A, C, T, or G to represent the four DNA nucleotides. Additionally, the letter N can appear in the sequence if the base could not be determined.
- A separator, which is just a +.
- Base call quality scores in Phred format, used for quality control software such as fastqc.
Next we will go over how to use cellranger
to perform alignment and gene quantification steps.
Sequence alignment and gene quantification with cellranger count
In this section I will briefly go over how to use cellranger count
to perform sequence alignment and gene quantification, as well as how to interpret the output of this software. Here is a link to the documentation for cellranger count
. Fortunately, this program is pretty easy to use, it just might take a while to run for each sample. I generally run this program as a job array on the HPC, where one job is for each sample.
Running cellranger-count
To load cellranger on the HPC we must first do the following:
module load cellranger
The following is a basic command that I used for cellranger count
. To do this efficiently, you can write a script to submit each sample as a job on the HPC. We must provide a reference transcriptome for cellranger
to map our reads, and this can be downloaded directly from 10X’s website at this link. Take note of whether you need the human reference genome GRCh38
or the mouse reference genome mm10
.
JOBID="Sample-100"
SAMPLE_IDS="Sample-100-GGTCAATA,Sample-100-ATCTTTAG,Sample-100-CAGAGGCC,Sample-100-TCAGCCGT"
TRANSCRIPTOME="/dfs3/swaruplab/smorabit/pipelines/sn-rna-seq/GRCh38.p12.mrna"
FASTQS="/dfs3/swaruplab/smorabit/data/AD_NucSeq_2019/SWA_13952_B01_NAN_Lane"
cellranger count --id=$JOBID \
--transcriptome=$TRANSCRIPTOME \
--fastqs=$FASTQS \
--sample=$SAMPLE_IDS \
--localcores=16 \
--localmem=64 \
--expect-cells=10000
cellranger-count
输出
If everything ran correctly and did not cause any errors, it is time to check on the output files. All output files are dumped into the outs
directory. The following files and directories are found in the outs
directory:
analysis
cloupe.cloupe
filtered_feature_bc_matrix
filtered_feature_bc_matrix.h5
metrics_summary.csv
molecule_info.h5
possorted_genome_bam.bam
possorted_genome_bam.bam.bai
raw_feature_bc_matrix
raw_feature_bc_matrix.h5
web_summary.html
The aligned sequences are stored in the possorted_genome_bam.bam
file, indexed by the possorted_genome_bam.bam.bai
file. .bam
is a standard format for storing aligned nucleotide sequences. It is a compressed version of the .sam
sequence alignment map format. If you are interested in further understanding how aligned sequence data is formatted, look at this comprehensive document.
Next I will dive deeper into some of these output files that will be more directly relevant for downstream analysis. Check the 10X Genomics website for more details about these output files.
分析总结报告解读
The first file that we want to check on is the web_summary.html
file, which provides an overview of both sample quality and a basic analysis. Below I have included a screenshot of this web summary output.
The summary page tells you important statistics like the number of cells detected, number of reads per cell, as well as several other sequencing and mapping statistics. Additionally, error and warning messages may appear on this summary page if cellranger
detected any issues with the sequencing experiment itself, for example extremely low read-depth.
Next let’s look at the Analysis page.
Here we can see a basic data analysis that cellranger
has done for us, including a t-SNE dimensionality reduction, clustering analysis, and differential gene expression analysis. This is not the analysis that you would want to use for a publication, but it is useful in that it gives us a basic idea of how many different cell types we see in the data, and some of the genes that are highly expressed in each cluster.
filtered_feature_bc_matrix
directory
This is a directory that contains all of the necessary data required for downstream analysis. This data has already beed filtered according to cellranger
’s quality control. You can also fild the unfiltered data under the directory raw_feature_bc_matrix
. The following files are found in the filtered_feature_bc_matrix
directory:
barcodes.tsv
features.tsv
matrix.mtx
The genes-by-cells expression matrix matrix.mtx
is stored in MatrixMarket format. This is a format for efficiently storing sparse matrices, where many entries have a zero value. The barcodes.tsv
file contains a list of nucleotide barcodes that serve as unique identifiers for each cell. During the 10X snRNA-seq library preparation, each molecule belonging to a single cell was tagged with one of these unique identifiers, which are referred to as barcodes. Finally, the features.tsv
file contains a table of gene names and gene IDs. Using these three files we can begin to analyze our snRNA-seq experiment using a programming language such as R, Python, Julia, MATLAB, etc.
Next I will go over an optional step for cellranger to aggregate multiple experiments into a single expression matrix.
Aggregation of single-cell experiments using cellranger aggr
In most cases, we are going to analyze more than one single-cell experiment simultaneously. While our goal is to perform a data analysis that will allow us to learn about the underlying biology of our samples, there are inevitably going to be technical biases introduced throughout the experimental process. For example, since 10X Genomics Chromium controller only allows for 8 samples to run at a time, the experimentalists are going to have to perform some experiments a few days apart.
There are many computational methods for ‘data correction’ to remove some of these technical biases. One simple way of doing so is to sub-sample reads from each separate experiment in order to have the same read depth across all samples. This is done using a program called cellranger aggr
, simultaneously performing this sampling process and providing us with an aggregated gene expression matrix.
Running cellranger aggr
The following is a basic command to run cellranger aggr
. It is very important to create a .csv
containing the paths to the output of cellranger count
for each sample that you wish to aggregate.
JOBID="aggr-samples"
CSV="files.csv"
cellranger aggr --id=$JOBID \
--csv=$CSV \
--localcores=8 \
--localmem=64
The format for the .csv
is the following:
library_id | molecule_h5 |
---|---|
Sample-19 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-19/outs/molecule_info.h5 |
Sample-43 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-43/outs/molecule_info.h5 |
Sample-50 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-50/outs/molecule_info.h5 |
Sample-22 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-22/outs/molecule_info.h5 |
Sample-27 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-27/outs/molecule_info.h5 |
Sample-37 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-37/outs/molecule_info.h5 |
Sample-17 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-17/outs/molecule_info.h5 |
Sample-52 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-52/outs/molecule_info.h5 |
Sample-96 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-96/outs/molecule_info.h5 |
Sample-33 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-33/outs/molecule_info.h5 |
Sample-46 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-46/outs/molecule_info.h5 |
Sample-90 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-90/outs/molecule_info.h5 |
Sample-58 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-58/outs/molecule_info.h5 |
Sample-82 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-82/outs/molecule_info.h5 |
Sample-45 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-45/outs/molecule_info.h5 |
Sample-100 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-100/outs/molecule_info.h5 |
Sample-66 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-66/outs/molecule_info.h5 |
Sample-47 | /dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-47/outs/molecule_info.h5 |
Output of cellranger aggr
I won’t go into much details about this output, because overall it is quite similar to the output from cellranger aggr
. The filtered_feature_bc_matrix
directory is still where all of the relevant files for data analysis are going to be dumped by the program. The following image is the summary page from the cellranger aggr
web report.
Conclusion
We covered how to take raw sequences from a 10X Genomics single-cell RNA-seq experiment and process them using the cellranger
software, yielding a genes-by-cells expression matrix that we can use for downstream data analysis. Here we can see that there are new statistics specifically for the aggregation process, such as the fragment of reads retained.
如若转载,请注明出处:https://www.ouq.net/3011.html