cellranger是处理从 10X Genomics 平台生成的单细胞 RNA-seq 数据的命令行工具。cellranger 由 10X Genomics 构建，专门用于处理从其平台生成的数据，然而有几种开源工具可以与 cellranger 竞争执行这些处理任务，例如 kallisto-bustools。

Background and raw data format

single-cell RNA-seq experiments with 10X Genomics

Here I will very briefly go over the basics of performing a single-cell RNA-seq experiment with 10X Genomics. You can watch a great video from 10X about how it all works here.

Before running the 10X Genomics Chromium machine, we must first create a single cell suspension, essentially a tiny tube filled with nothing but cells. We start with a tissue of interest, in our lab’s case the tissue samples come from the brain. The brain is composed of many different cell types, including neurons, astrocytes, oligodendrocytes, microglia, and several others.

neuron cartoon from Wikipedia

For various reasons, single-nucleus genomic assays are more commonly used to study brain tissue rather than single-cell assays. We can lyse (burst open) all cells in a tissue, remove any extra cellular debris, gather up the nuclei, and perform a successful 10X experiment. DNA lives in the nucleus, which is also the part of the cell where the process of transcription (conversion from DNA to RNA) occurs. RNA splicing occurs in the nucleus as well, which occurs before mature mRNA molecules are trafficked out of the nucleus and into other parts of the cell. Using single-nucleus RNA-seq, or snRNA-seq, requires some additional considerations, especially with regards to analysis pipelines that make assumptions about splicing, such as RNA velocity analysis.

The following figure summarizes the steps in the 10X Genomics experiment following the single-cell or single-nuclei suspension.

The 10X protocol relies on performing micro-reactions inside of tiny bubbles called Gel Bead-in EMulsions (GEMs). Within these droplets, specialized cellular barcodes are ligated (attached) to the cDNA (RNA converted into DNA), so we can bioinformatically resolve which molecule comes from each cell. Additional barcodes called Unique Molecular Identifiers (UMIs) are attached to each cDNA molecule, which are useful in accurately quantifying gene expression. Once we have run the samples through the 10X machine, we end up with barcoded cDNA, which will be used for sequencing library preparation followed by Illumina sequencing.

Illumina Sequencing

Illumina sequencing relies on reading millions of short DNA sequences. If you are interested in exactly how Illumina sequencing works, check out this video. It turns out that from a technical standpoint, it is challenging to accurately read long nucleotide sequences, and thus Illumina sequening is a short-read technology, capturing around 150 nucleotides in a single read. Illumina flow cells have 8 lanes, meaning 8 different samples can be sequenced at the same time. Sequencing samples together in this manner is referred to as sample multiplexing.

Following Illumina sequencing, you end up with a mountain of these short nucleotide sequences, without any sense for where these sequences fit in the genome. At this point we must use sequence alignment algorithms, such as the Burrows-Wheeler transform or the Smith-Waterman algorithm, in order to map these reads to the reference genome. Fortunately for us, there are robust and mature software tools for performing this critical alignment step. There are actually alignment tools tailored specifically for RNA-seq data, such as STAR.

Additionally, Paired-end sequencing is a common technique that increases the accuracy of mapping to the reference genome, utilizing a fixed-length gap in between the two paired reads (summarized in the figure below). Finally, if you are working in a less studied organism that does not have a reference genome, you will have to perform de novo genome assembly.

Source: Illumina

`.fastq` 文件格式

Illumina provides a detailed explanation of .fastq files from Illumina sequencing data here, and I will go through some of the details of this file format below. Let’s look at the first few lines of one of the .fastq files from one of the snRNA-seq samples using zcat and head.

zcat data/AD_NucSeq_2019/SWA_13952_B01_NAN_Lane/Sample-22-GTCGATGC_S56_L003_R2_001.fastq.gz | head

@A00217:98:HJY25DSXX:3:1101:6460:1000 2:N:0:NTCGATGC
CATAAATCAATGGGCTGAAGTTGGAGGAGAGGTTGATTACACAGGAGCATGGGTAATTTTTAGTGTGATGGAATTACCTTAATTTTTGTGATGATTACCC
+
FFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00217:98:HJY25DSXX:3:1101:8214:1000 2:N:0:NTCGATGC
AGTAGCTGTGTACTTTCTGATCGGAGCTGTTTAAGAGTTACGGCTGGTCCCACCTCACTGGATCTTGATGTTAATGTTCAGAGGACTCTAGACAACAACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00217:98:HJY25DSXX:3:1101:9733:1000 2:N:0:NTCGATGC
GATTTAACAAATTGTTCAGCATAACACTTCTAATTAAGTTTATCAAGTTGTACTGTATTAGATAATCAGCAGTGTATCTGGAGTATGTTTAAAGAGAACA

In .fastq format, there are four lines for each read.

A sequence identifier about the sequencing run and the cluster.
The actual nucleotide sequence, can be A, C, T, or G to represent the four DNA nucleotides. Additionally, the letter N can appear in the sequence if the base could not be determined.
A separator, which is just a +.
Base call quality scores in Phred format, used for quality control software such as fastqc.

Next we will go over how to use cellranger to perform alignment and gene quantification steps.

Sequence alignment and gene quantification with `cellranger count`

In this section I will briefly go over how to use cellranger count to perform sequence alignment and gene quantification, as well as how to interpret the output of this software. Here is a link to the documentation for cellranger count. Fortunately, this program is pretty easy to use, it just might take a while to run for each sample. I generally run this program as a job array on the HPC, where one job is for each sample.

Running `cellranger-count`

To load cellranger on the HPC we must first do the following:

module load cellranger

The following is a basic command that I used for cellranger count. To do this efficiently, you can write a script to submit each sample as a job on the HPC. We must provide a reference transcriptome for cellranger to map our reads, and this can be downloaded directly from 10X’s website at this link. Take note of whether you need the human reference genome GRCh38 or the mouse reference genome mm10.

JOBID="Sample-100"
SAMPLE_IDS="Sample-100-GGTCAATA,Sample-100-ATCTTTAG,Sample-100-CAGAGGCC,Sample-100-TCAGCCGT"
TRANSCRIPTOME="/dfs3/swaruplab/smorabit/pipelines/sn-rna-seq/GRCh38.p12.mrna"
FASTQS="/dfs3/swaruplab/smorabit/data/AD_NucSeq_2019/SWA_13952_B01_NAN_Lane"

cellranger count --id=$JOBID \
                 --transcriptome=$TRANSCRIPTOME \
                 --fastqs=$FASTQS \
                 --sample=$SAMPLE_IDS \
                 --localcores=16 \
                 --localmem=64 \
                 --expect-cells=10000

`cellranger-count`输出

If everything ran correctly and did not cause any errors, it is time to check on the output files. All output files are dumped into the outs directory. The following files and directories are found in the outs directory:

analysis
cloupe.cloupe
filtered_feature_bc_matrix
filtered_feature_bc_matrix.h5
metrics_summary.csv
molecule_info.h5
possorted_genome_bam.bam
possorted_genome_bam.bam.bai
raw_feature_bc_matrix
raw_feature_bc_matrix.h5
web_summary.html

The aligned sequences are stored in the possorted_genome_bam.bam file, indexed by the possorted_genome_bam.bam.bai file. .bam is a standard format for storing aligned nucleotide sequences. It is a compressed version of the .sam sequence alignment map format. If you are interested in further understanding how aligned sequence data is formatted, look at this comprehensive document.

Next I will dive deeper into some of these output files that will be more directly relevant for downstream analysis. Check the 10X Genomics website for more details about these output files.

分析总结报告解读

The first file that we want to check on is the web_summary.html file, which provides an overview of both sample quality and a basic analysis. Below I have included a screenshot of this web summary output.

The summary page tells you important statistics like the number of cells detected, number of reads per cell, as well as several other sequencing and mapping statistics. Additionally, error and warning messages may appear on this summary page if cellranger detected any issues with the sequencing experiment itself, for example extremely low read-depth.

Next let’s look at the Analysis page.

Here we can see a basic data analysis that cellranger has done for us, including a t-SNE dimensionality reduction, clustering analysis, and differential gene expression analysis. This is not the analysis that you would want to use for a publication, but it is useful in that it gives us a basic idea of how many different cell types we see in the data, and some of the genes that are highly expressed in each cluster.

`filtered_feature_bc_matrix` directory

This is a directory that contains all of the necessary data required for downstream analysis. This data has already beed filtered according to cellranger’s quality control. You can also fild the unfiltered data under the directory raw_feature_bc_matrix. The following files are found in the filtered_feature_bc_matrix directory:

barcodes.tsv
features.tsv
matrix.mtx

The genes-by-cells expression matrix matrix.mtx is stored in MatrixMarket format. This is a format for efficiently storing sparse matrices, where many entries have a zero value. The barcodes.tsv file contains a list of nucleotide barcodes that serve as unique identifiers for each cell. During the 10X snRNA-seq library preparation, each molecule belonging to a single cell was tagged with one of these unique identifiers, which are referred to as barcodes. Finally, the features.tsv file contains a table of gene names and gene IDs. Using these three files we can begin to analyze our snRNA-seq experiment using a programming language such as R, Python, Julia, MATLAB, etc.

Next I will go over an optional step for cellranger to aggregate multiple experiments into a single expression matrix.

Aggregation of single-cell experiments using `cellranger aggr`

In most cases, we are going to analyze more than one single-cell experiment simultaneously. While our goal is to perform a data analysis that will allow us to learn about the underlying biology of our samples, there are inevitably going to be technical biases introduced throughout the experimental process. For example, since 10X Genomics Chromium controller only allows for 8 samples to run at a time, the experimentalists are going to have to perform some experiments a few days apart.

There are many computational methods for ‘data correction’ to remove some of these technical biases. One simple way of doing so is to sub-sample reads from each separate experiment in order to have the same read depth across all samples. This is done using a program called cellranger aggr, simultaneously performing this sampling process and providing us with an aggregated gene expression matrix.

Running `cellranger aggr`

The following is a basic command to run cellranger aggr. It is very important to create a .csv containing the paths to the output of cellranger count for each sample that you wish to aggregate.

JOBID="aggr-samples"
CSV="files.csv"
cellranger aggr --id=$JOBID \
                --csv=$CSV \
                --localcores=8 \
                --localmem=64

The format for the .csv is the following:

library_id	molecule_h5
Sample-19	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-19/outs/molecule_info.h5
Sample-43	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-43/outs/molecule_info.h5
Sample-50	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-50/outs/molecule_info.h5
Sample-22	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-22/outs/molecule_info.h5
Sample-27	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-27/outs/molecule_info.h5
Sample-37	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-37/outs/molecule_info.h5
Sample-17	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-17/outs/molecule_info.h5
Sample-52	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-52/outs/molecule_info.h5
Sample-96	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-96/outs/molecule_info.h5
Sample-33	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-33/outs/molecule_info.h5
Sample-46	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-46/outs/molecule_info.h5
Sample-90	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-90/outs/molecule_info.h5
Sample-58	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-58/outs/molecule_info.h5
Sample-82	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-82/outs/molecule_info.h5
Sample-45	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-45/outs/molecule_info.h5
Sample-100	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-100/outs/molecule_info.h5
Sample-66	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-66/outs/molecule_info.h5
Sample-47	/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/cellranger_count_mrna/redo/Sample-47/outs/molecule_info.h5

Output of `cellranger aggr`

I won’t go into much details about this output, because overall it is quite similar to the output from cellranger aggr. The filtered_feature_bc_matrix directory is still where all of the relevant files for data analysis are going to be dumped by the program. The following image is the summary page from the cellranger aggr web report.

Conclusion

We covered how to take raw sequences from a 10X Genomics single-cell RNA-seq experiment and process them using the cellranger software, yielding a genes-by-cells expression matrix that we can use for downstream data analysis. Here we can see that there are new statistics specifically for the aggregation process, such as the fragment of reads retained.

本站原创，如若转载，请注明出处：https://www.ouq.net/3011.html

Cellranger分析10X Genomics single-cell RNA-seq data单细胞原始数据

Background and raw data format

single-cell RNA-seq experiments with 10X Genomics

Illumina Sequencing

.fastq 文件格式

Sequence alignment and gene quantification with cellranger count

Running cellranger-count

cellranger-count输出

分析总结报告解读

filtered_feature_bc_matrix directory

Aggregation of single-cell experiments using cellranger aggr

Running cellranger aggr

Output of cellranger aggr

Conclusion

相关推荐

10X Genomics 单细胞 RNA 测序（scRNA-seq）文库制备协议: 10XV2 和 10XV3

single-cell data analysis software packages

R:integrating and analysis single-cell datasets

R: Load in data from 10X

VISION：Functional Interpretation for scRNA-seq Data