RT-PCRやメタバーコーディング用のプライマー設計などの用途で、NGSデータから手っ取り早く遺伝子の断片配列を得たいことがある。これは非モデル生物において特に重要だ。TrinityやSPAdes、MEGAHITといったde novoアセンブラは高性能だが、そのぶん計算に時間がかかる。
簡易的なde novoアセンブラのTadpoleを用いることで、RT-PCRのinternal controlに用いる ハウスキーピング遺伝子や、メタバーコーディングプライマー設計に用いる16S rDNAやcytochrome oxidase Iといった遺伝子の断片配列をNGSデータから迅速に得られると期待される。
Tadpole
Tadpoleが同梱されているBBToolsは、米Joint Genome Institute (JGI)が開発しているNGSデータ用ツールキット。Macでインフォマティクスで日本語の詳しい解説がみられる。SourceForgeまたはconda経由でインストールするとよい。
BBToolsに含まれるTadpoleは、NGSデータからk-merを抽出して (=データを圧縮して) de novo アセンブルしたり、リードのエラーを修正するプログラム。他のショートリードアセンブラと大まかには同じ仕組みで動いているが、メモリを節約しつつ高速で計算できる。
$ tadpole.sh
Written by Brian Bushnell
Last modified June 27, 2017
Description: Uses kmer counts to assemble contigs, extend sequences,
or error-correct reads. Tadpole has no upper bound for kmer length,
but some values are not supported. Specifically, it allows 1-31,
multiples of 2 from 32-62, multiples of 3 from 63-93, etc.
Please read bbmap/docs/guides/TadpoleGuide.txt for more information.
Usage:
Assembly: tadpole.sh in=<reads> out=<contigs>
Extension: tadpole.sh in=<reads> out=<extended> mode=extend
Correction: tadpole.sh in=<reads> out=<corrected> mode=correct
Extension and correction may be done simultaneously. Error correction on
multiple files may be done like this:
tadpole.sh in=libA_r1.fq,libA_merged.fq in2=libA_r2.fq,null extra=libB_r1.fq out=ecc_libA_r1.fq,ecc_libA_merged.fq out2=ecc_libA_r2.fq,null mode=correct
Extending contigs with reads could be done like this:
tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62
Input parameters:
in=<file> Primary input file for reads to use as kmer data.
in2=<file> Second input file for paired data.
extra=<file> Extra files for use as kmer data, but not for error-
correction or extension.
reads=-1 Only process this number of reads, then quit (-1 means all).
NOTE: in, in2, and extra may also be comma-delimited lists of files.
Output parameters:
out=<file> Write contigs (in contig mode) or corrected/extended
reads (in other modes).
outd=<file> Write discarded reads, if using junk-removal flags.
dump=<file> Write kmers and their counts.
fastadump=t Write kmers and counts as fasta versus 2-column tsv.
mincounttodump=1 Only dump kmers with at least this depth.
showstats=t Print assembly statistics after writing contigs.
Prefiltering parameters:
prefilter=0 If set to a positive integer, use a countmin sketch
to ignore kmers with depth of that value or lower.
prehashes=2 Number of hashes for prefilter.
prefiltersize=0.2 (pff) Fraction of memory to use for prefilter.
minprobprefilter=t (mpp) Use minprob for the prefilter.
prepasses=1 Use this many prefiltering passes; higher be more thorough
if the filter is very full. Set to 'auto' to iteratively
prefilter until the remaining kmers will fit in memory.
onepass=f If true, prefilter will be generated in same pass as kmer
counts. Much faster but counts will be lower, by up to
prefilter's depth limit.
Hashing parameters:
k=31 Kmer length (1 to infinity). Memory use increases with K.
prealloc=t Pre-allocate memory rather than dynamically growing;
faster and more memory-efficient. A float fraction (0-1)
may be specified; default is 1.
minprob=0.5 Ignore kmers with overall probability of correctness below this.
minprobmain=t (mpm) Use minprob for the primary kmer counts.
threads=X Spawn X hashing threads (default is number of logical processors).
rcomp=t Store and count each kmer together and its reverse-complement.
Assembly parameters:
mincountseed=3 (mcs) Minimum kmer count to seed a new contig or begin extension.
mincountextend=2 (mce) Minimum kmer count continue extension of a read or contig.
mincountretain=0 (mincr) Discard kmers with count below this.
maxcountretain=INF (maxcr) Discard kmers with count above this.
branchmult1=20 (bm1) Min ratio of 1st to 2nd-greatest path depth at high depth.
branchmult2=3 (bm2) Min ratio of 1st to 2nd-greatest path depth at low depth.
branchlower=3 (blc) Max value of 2nd-greatest path depth to be considered low.
minextension=1 (mine) Do not keep contigs that did not extend at least this much.
mincontig=auto (minc) Do not write contigs shorter than this.
mincoverage=1 (mincov) Do not write contigs with average coverage below this.
trimends=0 (trim) Trim contig ends by this much. Trimming by K/2
may yield more accurate genome size estimation.
contigpasses=16 Build contigs with decreasing seed depth for this many iterations.
contigpassmult=1.7 Ratio between seed depth of two iterations.
ownership=auto For concurrency; do not touch.
Processing modes:
mode=contig contig: Make contigs from kmers.
extend: Extend sequences to be longer, and optionally
perform error correction.
correct: Error correct only.
insert: Measure insert sizes.
Extension parameters:
extendleft=100 (el) Extend to the left by at most this many bases.
extendright=100 (er) Extend to the right by at most this many bases.
ibb=t (ignorebackbranches) Do not stop at backward branches.
extendrollback=3 Trim a random number of bases, up to this many, on reads
that extend only partially. This prevents the creation
of sharp coverage discontinuities at branches.
Error-correction parameters:
ecc=f Error correct via kmer counts.
reassemble=t If ecc is enabled, use the reassemble algorithm.
pincer=f If ecc is enabled, use the pincer algorithm.
tail=f If ecc is enabled, use the tail algorithm.
eccfull=f If ecc is enabled, use tail over the entire read.
aggressive=f (aecc) Use aggressive error correction settings.
Overrides some other flags like errormult1 and deadzone.
conservative=f (cecc) Use conservative error correction settings.
Overrides some other flags like errormult1 and deadzone.
rollback=t Undo changes to reads that have lower coverage for
any kmer after correction.
markbadbases=0 (mbb) Any base fully covered by kmers with count below
this will have its quality reduced.
markdeltaonly=t (mdo) Only mark bad bases adjacent to good bases.
meo=t (markerrorreadsonly) Only mark bad bases in reads
containing errors.
markquality=0 (mq) Set quality scores for marked bases to this.
A level of 0 will also convert the base to an N.
errormult1=16 (em1) Min ratio between kmer depths to call an error.
errormult2=2.6 (em2) Alternate ratio between low-depth kmers.
errorlowerconst=3 (elc) Use mult2 when the lower kmer is at most this deep.
mincountcorrect=3 (mcc) Don't correct to kmers with count under this.
pathsimilarityfraction=0.45(psf) Max difference ratio considered similar.
Controls whether a path appears to be continuous.
pathsimilarityconstant=3 (psc) Absolute differences below this are ignored.
errorextensionreassemble=5 (eer) Verify this many kmers before the error as
having similar depth, for reassemble.
errorextensionpincer=5 (eep) Verify this many additional bases after the
error as matching current bases, for pincer.
errorextensiontail=9 (eet) Verify additional bases before and after
the error as matching current bases, for tail.
deadzone=0 (dz) Do not try to correct bases within this distance of
read ends.
window=12 (w) Length of window to use in reassemble mode.
windowcount=6 (wc) If more than this many errors are found within a
a window, halt correction in that direction.
qualsum=80 (qs) If the sum of the qualities of corrected bases within
a window exceeds this, halt correction in that direction.
rbi=t (requirebidirectional) Require agreement from both
directions when correcting errors in the middle part of
the read using the reassemble algorithm.
errorpath=1 (ep) For debugging purposes.
Junk-removal parameters:
tossjunk=f Remove reads that cannot be used for assembly.
This means they have no kmers above depth 1 (2 for paired
reads) and the outermost kmers cannot be extended.
Pairs are removed only if both reads fail.
tossdepth=-1 Remove reads containing kmers at or below this depth.
Pairs are removed if either read fails.
lowdepthfraction=0 (ldf) Require at least this fraction of kmers to be
low-depth to discard a read; range 0-1. 0 still
requires at least 1 low-depth kmer.
requirebothbad=f (rbb) Only discard pairs if both reads are low-depth.
tossuncorrectable (tu) Discard reads containing uncorrectable errors.
Requires error-correction to be enabled.
Shaving parameters:
shave=f Remove dead ends (aka hair).
rinse=f Remove bubbles.
maxshavedepth=1 (msd) Shave or rinse kmers at most this deep.
maxshavedepth=1 (msd) Shave or rinse kmers at most this deep.
exploredist=100 (sed) Quit after exploring this far.
discardlength=150 (sdl) Discard shavings up to this long.
Overlap parameters (for overlapping paired-end reads only):
merge=f Attempt to merge reads before counting kmers.
ecco=f Error correct via overlap, but do not merge reads.
Java Parameters:
-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da Disable assertions.
使用例
淡水魚に寄生する甲殻類であるチョウモドキ (Argulus coregoni) のRNA-seqデータを使用する。約5.4 million reads, 1.6 Gbのデータセット。クオリティトリムのステップは省いた。
seqfu stats ERR3672040_1.fastq.gz ERR3672040_2.fastq.gz
File #Seq Total bp Avg N50 N75 N90 auN Min Max
ERR3672040_1.fastq.gz 5405433 810814950 150.00 150 150 150 0.00 150 150
ERR3672040_2.fastq.gz 5405433 810814950 150.00 150 150 150 0.00 150 150
Tadpole.sh
tadpole.sh mode=contig
でアセンブルする。2>
は標準エラーをファイルにリダイレクト (出力) するおまじない。
tadpole.sh mode=contig in=ERR3672040_1.fastq.gz in2=ERR3672040_2.fastq.gz out=ERR3672040.tadpole.fa 2>ERR3672040.tadpole.err
16 threads, 32GB RAMのPC上でWSL2を用いて解析したところ、30秒程度で完了した。
java -ea -Xmx9360m -Xms9360m -cp /home/kawato/miniconda3/bbtools/lib/current/ assemble.Tadpole mode=contig in=ERR3672040_1.fastq.gz in2=ERR3672040_2.fastq.gz out=ERR3672040.tadpole.fa
Executing assemble.Tadpole1 [mode=contig, in=ERR3672040_1.fastq.gz, in2=ERR3672040_2.fastq.gz, out=ERR3672040.tadpole.fa]
Tadpole version 37.62
Using 16 threads.
Executing kmer.KmerTableSet [mode=contig, in=ERR3672040_1.fastq.gz, in2=ERR3672040_2.fastq.gz, out=ERR3672040.tadpole.fa]
Initial:
Ways=41, initialSize=128000, prefilter=f, prealloc=f
Memory: max=9405m, free=9258m, used=147m
Initialization Time: 0.045 seconds.
Loading kmers.
Estimated kmer capacity: 255962721
After table allocation:
Memory: max=9405m, free=9160m, used=245m
After loading:
Memory: max=8724m, free=4405m, used=4319m
Input: 10810866 reads 1621629900 bases.
Unique Kmers: 117624409
Load Time: 20.703 seconds.
Reads Processed: 10810k 522.18k reads/sec
Bases Processed: 1621m 78.33m bases/sec
Building contigs.
Seeding with min count = 7900, 4646, 2733, 1607, 945, 556, 327, 192, 113, 66, 38, 22, 13, 7, 4, 3
After building contigs:
Memory: max=8724m, free=3659m, used=5065m
Bases generated: 22033332
Contigs generated: 70901
Longest contig: 4060
Contig-building time: 8.265 seconds.
Total Time: 29.027 seconds.
COX1配列の探索
メタバーコーディングで広く用いられているcytochrome oxidase I (COI/COX1) 遺伝子の配列を探索する。
アセンブリに対する相同性検索
クエリ配列は次の通り。
>BCL84741.1 cytochrome c oxidase subunit I (mitochondrion) [Argulus japonicus]
MPSRWLFSTNHKDIGTLYFIFGAASGMLGASLSMMIRFELFQPGSVFNNSQFYNSVVTAHAFIMIFFMVM
PIMIGGFGNWLVPLMLETPDMAFPRLNNMSFWLLPPSLSLLLMGAAVESGAGTGWTVYPPLSSIPAHSGP
SVDLSIFSLHLAGVSSILGAINFIATIFNMRAPNIELSMIPLFVWSVLITAFLLLFSLPVLAGAITMLLT
DRNLNTSFFDPTGGGDPILFQHLFWFFGHPEVYILILPGFGIVSHVIAHSSNKKEVFSKVSMGYAMAVIG
VLGFAVWAHHMFTVGLDTDTRAFFTAATMIIAIPTGIKVFSWLMTVFGSHITYTPDVLWSLGFIFLFTVG
GLTGVVLSNSSIDVALHDTYYVVAHFHYVLSMGAVFAILAGFVHWFPLLTGLTMKDSWLKIQFYMMFAGV
NITFFPQHFLGMMGMPRRYSDYPDAYYSWNMVSSVGSMISFCSTLLLIFTIMEAFYAKRKVVAPLIYSNS
MEWFHSVPPISHTYSEIPMILV
TBLASTNで検索する。
tblastn -query BCL84741.faa -subject ERR3672040.tadpole.fa -outfmt 7 -out BCL84741.tblastn.out
結果は以下の通り。
# TBLASTN 2.13.0+
# Query: BCL84741.1 cytochrome c oxidase subunit I (mitochondrion) [Argulus japonicus]
# Database: User specified sequence set (Input: ERR3672040.tadpole.fa)
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 10 hits found
BCL84741.1 contig_13,length=1444,cov=6178.2,gc=0.352 87.708 480 59 0 18 497 1442 3 0.0 737
BCL84741.1 contig_29781,length=804,cov=11.3,gc=0.430 63.786 243 88 0 264 506 3 731 9.78e-85 262
BCL84741.1 contig_32173,length=545,cov=7.0,gc=0.455 66.298 181 61 0 9 189 2 544 1.93e-55 183
BCL84741.1 contig_44226,length=231,cov=6.8,gc=0.450 83.051 59 10 0 204 262 179 3 5.82e-28 105
BCL84741.1 contig_12340,length=293,cov=31.3,gc=0.334 41.026 39 21 1 183 221 111 1 1.6 28.9
BCL84741.1 contig_13058,length=2772,cov=29.1,gc=0.413 50.000 32 15 1 228 258 2588 2683 1.9 30.4
BCL84741.1 contig_18911,length=378,cov=20.0,gc=0.418 53.571 28 11 1 246 271 274 357 3.4 28.5
BCL84741.1 contig_61023,length=130,cov=3.9,gc=0.369 55.556 18 8 0 267 284 77 24 3.5 26.6
BCL84741.1 contig_8893,length=215,cov=92.4,gc=0.270 45.455 44 17 2 159 197 26 151 6.6 26.6
BCL84741.1 contig_16860,length=218,cov=42.4,gc=0.454 45.455 22 12 0 443 464 208 143 7.4 26.2
# BLAST processed 1 queries
当該配列の抽出
seqfu grep
で当該配列を抽出する。
seqfu grep -n "contig_13," ERR3672040.tadpole.fa >contig_13.fa
結果は以下の通り。
>contig_13,length=1444,cov=6178.2,gc=0.352
GGAACTCTGTGGAATCATTCTATAGAATTTCTATAAATTAGGGGTGCAACAATTTTTCGTTTAGTGTAGAAGGCTTCTATAATGGTAAAGATTAGAAGCAGAGTAGAGCAAAATGAAATTAGAGACCCCACAGAAGAAACTATATTTCATGAATAGTAAGCATCAGGGTAATCTGAATAACGTCGAGGCATTCCTATTATTCCTAAAAAGTGTTGAGGAAAAAATGTAAGATTTACTCCTGTAAATATTAAAGTAAATTGAATTTTTAGTCATGAGTCTTTTATAGTTAATCCTGTTAGCAATGGGAACCAGTGTACAAATCCTGCTAAAATGGCGAATACGGCTCCTATAGATAAGACATAGTGAAAATGTGCTACTACATAGTAAGTGTCATGAAGGGCTACATCAATAGATGAGTTAGAAAGAACAACACCTGTTAATCCTCCTACTGTGAATAAAAAAATAAAACCTAAGGATCAAAGAACATCTGGGGTATAAGTGATATGTCTACCAAATACAGTTATTAGTCATCTAAAAACTTTAATTCCTGTGGGGATAGCAATAATTATAGTAGCAGCAGTAAAGAAAGCTCGGGTGTCTGTGTCTAAACCTACAGTGAATATATGGTGAGCTCATACAGCAAATCCTAATACTCCAATAACTGCTATAGCGTACCCTATTCTTACTTTTCTAAACACTTCTTTTTTATTAGATCTATGAGCAATTACATGAGATACAATTCCAAAGCCGGGAAGAATTAAAATATAAACTTCGGGATGACCGAAAAATCAAAATAAATGTTGAAACAAGATTGGGTCTCCTCCACCTGTGGGATCAAAGAAAGATGTATTTAGGTTTCGATCTGTAAGAAGCATTGTAATGGCCCCTGCTAATACTGGTAAAGAGAACAAAAGAAGGAAAGCTGTAATTAAAACTGATCATACAAAAAGTGGGATTATTGATAGTTCAATATTAGGGGCTCGTATGTTGAAAATTGTGGCAATAAAATTAATTGCCCCTAAAATTGATGAAATTCCTGCTAAGTGGAGAGAGAAAATTCTGAGATCTACTCTAGGGCCTCTATGGGCTGGGATCCTTCTTAATGGAGGGTAAACAGTTCATCCTGTTCCTGCTCCTGATTCTACTGCAGCTCCTATTAGAAGAAGACTGAGCGAAGGAGGAAGAAGTCAAAATCTTATATTATTAAGTCGTGGGAAAGCTATATCGGGAGTTTCTAATATTAAGGGTACTAGTCAATTACCAAATCCCCCAATTATAATAGGTATAACTATGAAAAAAATTATAATGAATGCGTGGGCAGTTACTACTCTGTTGTAGAATTGACTGTTATTAAAAACAGATCCGGGTTGAAAAAGTTCAAACCGAATTATCATTCTTAGCGAAGCTCCTAGTATTCCTGATGCTGCACCAAAGATGAAGTAAA
NCBI BLASTXで確認
NCBI BLASTXで検索し、COX1配列が得られたことを確認した。
注意点
あくまで簡易的な方法のため、キメラ配列などのエラーが多く含まれる可能性がある。データの解釈には注意が必要と思われる。またペアエンドリードの情報を活用したrepeat resolutionやscaffoldingを行わないため、得られる配列は必然的に断片的となる。全長配列を得たい場合はTrinityやSPAdesなどのアセンブラを用いるべきであろう。
References