Star Protocol: 从宏基因组中提取 CRISPR 靶向序列

基于同源性的搜索通常用于从宏基因组中发现移动遗传元件 (MGE),但它在很大程度上依赖于数据库中的参考基因组。 该方法可以在不使用参考数据库的情况下从组装的人宏基因组序列中提取 CRISPR 靶向序列。 我们描述了宏基因组重叠群的组装、CRISPR 直接重复序列和间隔区的提取、原型间隔区的发现以及使用基于图的方法提取富含原型间隔区的区域,可以提取许多特征/非特征的 MGE。

Star Protocol: 从宏基因组中提取 CRISPR 靶向序列

Preprocess raw paired FASTQ files

This process is quality control of the input sequences.

Alternatives: A pipeline script for preprocessing, assembling, and spacer collection is provided in the repository (pipeline_scripts/assembly_pipeline.sh).
  1. Trim poor-quality bases, adapters, and PhiX spikes.

$ bbduk.sh -Xmx100g t=8 in1=r1.fastq.gz in2=r2.fastq.gz out=trimmed.fastq.gz ftm=5 qtrim=rl trimq=20 ftl=15 ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 tpe tbo maq=20

  1. Remove human-derived reads.

$ bbmap.sh -Xmx100g t=8 ref={path to the human reference FASTA file} in=trimmed.fastq.gz outu=decontaminated.fastq.gz interleaved=t minratio=0.9 maxindel=3 bwr=0.16 bw=12 fast=t minhits=2 qtrim=r trimq=10 untrim=t idtag=t printunmappedcount=t kfilter=25 maxsites=1 k=14

Note: We used the human genome reference FASTA file posted by the author of BBtools.
  1. Correct errors in the reads.

$ tadpole.sh threads=8 -Xmx100g interleaved=t in=decontaminated.fastq.gz out=ecc.fastq.gz mode=correct

Critical: This output FASTQ file is analysis-ready and used for assembly and spacer extraction.

Metagenome assembly

TIMING: 1–12 h (Varies strongly depending on library)

Assemble metagenome contigs from the preprocessed FASTQ files.

  1. Assemble preprocessed FASTQ file using SPAdes.

$ spades.py -t 8 -m 100 --meta --only-assembler --12 ecc.fastq.gz -o {path to spades output directory}

Note: We skipped the error-correction step in the SPAdes program to avoid the occasional endlessly running problem. Please refer to the troubleshooting section for detail.
  1. Assign unique identifiers to all assembled contigs.

$ cd {path to spades output directory}

$ id=$(openssl rand -hex 12)

awk -v id=${id} '/ˆ>/ {n++; print ">contig_"id"_"n;} !/ˆ>/{print}' scaffolds.fasta > scaffolds.renamed.fasta

Critical: Assigning unique identifiers is important to avoid conflicts in the later analyses. Here, we used randomly generated strings; however, any unique readable identifier, such as sample IDs, could be used instead.
Optional: We recommend removing contigs shorter than 1k bases to reduce the file size.
Note: The pre and assembling processes are independently conducted for each paired FASTQ file.
Optional: We advise that the system has sufficient disk space to store all preprocessed FASTQ and assembled FASTA files. The rest of the files, including intermediate and temporary files, can be deleted to save the disk space.

Extraction of CRISPR direct repeats from the assembled contigs

TIMING: 1–2 h

From the assembled contigs, discover CRISPR consensus direct repeats.

  1. Run CRISPRDetect to discover consensus CRISPR direct repeats.

$ CRISPRDetect.pl -q 0 -f scaffolds.renamed.fasta -o crispr_detect_output -array_quality_score_cutoff 2

  1. From the output gff file, extract direct repeats, and convert them into a FASTA file.

$ grep 'repeat_region' crispr_detect_output.gff | cut -f 9 | tr ';' '∖n' | egrep 'ˆNote=' | cut -f 2 -d '=' | awk '{n++; printf(">DR_%i∖n%s∖n", n, $0)}' > crispr_dr.fasta

Note: Again, these direct repeat extractions are independently conducted for each assembled FASTA file.
  1. After discovering all direct repeats from each assembled FASTA file, we merge them into a single file, assign unique identifiers, and remove redundancy.

$ cat {all crispr dr fasta files} > merged_crispr_dr.fasta

$ awk '/ˆ>/ {n++; print ">dr_"n; } !/ˆ>/ {print}' merged_crispr_dr.fasta > merged_crispr_dr.renamed.fasta #giving new unique identifiers

$ cd-hit-est -S 1 -c 1 -i merged_crispr_dr.renamed.fasta -o merged_crispr_dr.repr.fasta

Optional: We provided the direct repeat sequences extracted from the human gut metagenomes in the previous study. The sequences are stored in a FASTA formatted file located in the supplementary folder (supplementary_data/crispr/drs/all_dr.clustered.fasta).

Extraction of CRISPR spacers from the preprocessed reads

TIMING: 1 h
Alternatives: A pipeline script for the spacer collection is provided in the repository (pipeline_scripts/collect_spacers.sh).

Here, we return to the preprocessed FASTQ files. From them, we extract CRISPR spacers from reads containing CRISPR direct repeats.

  1. Extract direct repeat-containing reads. This initial filtering step significantly reduces the number of reads, effectively speeding up the following spacer extraction processes.

$ bbduk.sh in=ecc.fastq.gz ref=merged_crispr_dr.repr.fasta outm=dr_containing_reads.fastq.gz interleaved=t k=21 hdist=1 rename=t

  1. Second filter to mask the direct repeats in the reads.

$ bbduk.sh in=dr_containing_reads.fastq.gz ref=merged_crispr_dr.repr.fasta kmask=R k=19 hdist=1 out=dr_masked.fastq mink=15

  1. Extract spacers from the masked reads using our python script.

$ {path to the virome_scripts directory}/pipeline_scripts/extract_spacers.py -s {sample name} dr_masked.fastq > spacers.fasta

Note: The python script used above is available from our repository (pipeline_scripts/extract_spacers.py).
Optional: The python program used above does not output short (<20 bp) and long (>50 bp) sequences by default. These parameters can be changed by options (-s and -l).
Note: Spacer extraction is conducted for each preprocessed FASTQ file.
  1. Merge all discovered spacers, assign unique identifiers, and remove redundancy.

$ cat {all spacer fasta files} > all_spacers.fasta

$ awk '/ˆ>/ {sub(/ˆ[ˆ[:space:]]+[[:space:]]/, ""); n++; printf(“>spacer_"n"∖t"$0);} !/ˆ>/ {print}' all_spacers.fasta > all_spacers.renamed.fasta #giving new unique identifiers

$ cd-hit-est -T 8 -M 10000 -d 0 -sf 1 -s 0.9 -c 0.98 -i all_spacers.renamed.fasta -o all_spacers.repr.fasta

Optional: It is highly advisable to make each spacer and direct repeat trackable to the original samples, contigs, and/or associated direct repeats. This can be achieved using FASTA descriptions in each record.
Optional: We provided the spacer sequences extracted from the human gut metagenomes in the previous study. The sequences are stored in a FASTA formatted file located in the supplementary folder (supplementary_data/crispr/spacers/all_spacers.clustered.removed_short.fasta).

Protospacer discovery

TIMING: 1–2 h

Here, we discover protospacers by aligning the spacer sequences to CRISPR-masked contigs.

  1. Search CRISPR direct repeats from the contigs.

$ makeblastdb -dbtype nucl -in scaffolds.renamed.fasta

$ blastn -evalue 1e-5 -task ‘blastn-short’ -outfmt 6 -num_threads 8 -query merged_crispr_dr.repr.fasta -db scaffolds.renamed.fasta > dr.blastn

  1. Mask sequences around direct repeat aligned regions.

$ samtools faidx scaffolds.renamed.fasta

$ awk ‘BEGIN{FS="∖t"; OFS="∖t"} $10<$9{t=$9; $9=$10; $10=t} {print $2,$9–1,$10}' dr.blastn | sort -k1,1 -k2,2n | bedtools merge -i /dev/stdin -d 60 | bedtools slop -b 60 -i /dev/stdin -g scaffolds.renamed.fasta.fai | bedtools maskfasta -bed /dev/stdin -fi scaffolds.renamed.fasta -fo crispr_masked.fasta

  1. Align spacers to the masked contigs.

$ makeblastdb -dbtype nucl -in crispr_masked.fasta

$ blastn -evalue 1e-5 -perc_identity 93 -task ‘blastn-short’ -outfmt 6 -num_threads 8 -query all_spacers.repr.fasta -db crispr_masked.fasta > spacers.blastn

Note: Protospacer search is conducted for each assembled FASTA file.
  1. Merge all discovered protospacers into a single BED-formatted file.

$ cat {all spacer blastn output files} | awk ‘BEGIN{FS="∖t"; OFS="∖t"} $10<$9{t=$9; $9=$10; $10=t} {print $2,$9–1,$10,$1}' | sort -k1,1 -k2,2n > all_protospacers.bed

Optional: We provided the protospacers discovered from the assembled human gut metagenome contigs in the previous study. The protospacers are stored in a BED-formatted file located in the supplementary folder (supplementary_data/targeted_sequences/protospacers/all_protospacers.bed).

Spacer clustering based on co-occurrence

TIMING: 2–4 h

Here, we cluster spacers using a co-occurrence network calculated from the spacer alignment result. From this network, we discover the graph communities, representing subsets of spacers that corresponding protospacers co-occur together across the assembled contigs. We extract the protospacer-enriched regions using these graph communities, i.e., spacer clusters. This process effectively reduces the false-positive ratio by excluding lone or randomly scattered protospacers.

  1. Initial clustering is based on distance.

$ bedtools cluster -d 50000 -i all_protospacers.bed | awk ‘BEGIN{FS="∖t"; OFS="∖t"} {print $1"_cluster_"$5,$2,$3,$4}' > initial_cluster.bed

  1. Calculate a weighted graph from the spacer co-occurrence and generate an abc formatted file.

$ {path to the virome_scripts directory}/graph_clustering/generate_abc_edgefile.py initial_cluster.bed > protospacers.abc

Note: The python script used above is available in our repository (graph_clustering/generate_abc_edgefile.py).
  1. Convert the abc file into an mci format.

$ mcxload -abc protospacers.abc --stream-mirror -write-tab data.tab -o protospacers.mci

  1. Run an mcl program to discover graph communities.

$ mcl protospacers.mci -I 4 -pi 0.4 -te 8

  1. Convert the mcl output to a tabular format.

$ mcxdump -icl out.protospacers.mci.I40pi04 -tabr data.tab > protospacers_cluster.tab

Critical: Each tab-delimited line within this output file consists of a spacer cluster.
  1. Mark CRISPR-targeted regions using the clustering result and the merged protospacer bed file.

$ {path to the virome_scripts directory}/graph_clustering/mark_crispr_targeted.py all_protospacers.bed protospacers_cluster.tab > crispr_targeted.bed

Note: The python script used above is available in our repository (graph_clustering/mark_crispr_targeted.py).
Note: The fourth column in the output file is formatted as {cluster id}:{protospacer count in the region}.
  1. Merge the regions to rescue the fragmented clusters.

$ sort -k1,1 -k2,2n crispr_targeted.bed > crispr_targeted.sorted.bed

$ bedtools merge -d 1000 -i crispr_targeted.sorted.bed -o collapse -c 4 > crispr_targeted.merged.bed

  1. Extract CRISPR-targeted regions from the assembled contigs.

$ samtools faidx scaffolds.renamed.fasta

$ cut -f 1,2 scaffolds.renamed.fasta.fai | sort -k 1,1 | join -t $'∖t' - <(sort -k 1,1 crispr_targeted.merged.bed) | awk 'BEGIN{OFS="∖t"; FS="∖t"} $4>$2{$4=$2} {print $1,$3,$4,$5}' | bedtools getfasta -fi scaffolds.renamed.fasta -bed /dev/stdin -fo crispr_targeted.fasta

Note: The extraction of CRISPR-targeted sequences is conducted for each assembled FASTA file.
Optional: We provided the CRISPR-targeted terminally redundant sequences extracted from the assembled human gut metagenome contigs in the previous study. The sequences are stored in a FASTA formatted file located in the supplementary folder (supplementary_data/targeted_sequences/tr/tr_sequences.fasta).
脚本下载:https://zenodo.org/record/6621424#.Yv4-cuxBzaU

如若转载,请注明出处:https://www.ouq.net/2124.html

(0)
打赏 微信打赏,为服务器增加50M流量 微信打赏,为服务器增加50M流量 支付宝打赏,为服务器增加50M流量 支付宝打赏,为服务器增加50M流量
上一篇 08/18/2022
下一篇 08/19/2022

相关推荐