1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

TadpoleでNGSデータを簡易的にde novo アセンブルし遺伝子の断片配列を得る

Posted at

RT-PCRやメタバーコーディング用のプライマー設計などの用途で、NGSデータから手っ取り早く遺伝子の断片配列を得たいことがある。これは非モデル生物において特に重要だ。TrinityやSPAdes、MEGAHITといったde novoアセンブラは高性能だが、そのぶん計算に時間がかかる。

簡易的なde novoアセンブラのTadpoleを用いることで、RT-PCRのinternal controlに用いる ハウスキーピング遺伝子や、メタバーコーディングプライマー設計に用いる16S rDNAやcytochrome oxidase Iといった遺伝子の断片配列をNGSデータから迅速に得られると期待される。

Tadpole

Tadpoleが同梱されているBBToolsは、米Joint Genome Institute (JGI)が開発しているNGSデータ用ツールキット。Macでインフォマティクスで日本語の詳しい解説がみられる。SourceForgeまたはconda経由でインストールするとよい。

BBToolsに含まれるTadpoleは、NGSデータからk-merを抽出して (=データを圧縮して) de novo アセンブルしたり、リードのエラーを修正するプログラム。他のショートリードアセンブラと大まかには同じ仕組みで動いているが、メモリを節約しつつ高速で計算できる。

$ tadpole.sh

Written by Brian Bushnell
Last modified June 27, 2017

Description:  Uses kmer counts to assemble contigs, extend sequences,
or error-correct reads.  Tadpole has no upper bound for kmer length,
but some values are not supported.  Specifically, it allows 1-31,
multiples of 2 from 32-62, multiples of 3 from 63-93, etc.
Please read bbmap/docs/guides/TadpoleGuide.txt for more information.

Usage:
Assembly:     tadpole.sh in=<reads> out=<contigs>
Extension:    tadpole.sh in=<reads> out=<extended> mode=extend
Correction:   tadpole.sh in=<reads> out=<corrected> mode=correct

Extension and correction may be done simultaneously.  Error correction on
multiple files may be done like this:

tadpole.sh in=libA_r1.fq,libA_merged.fq in2=libA_r2.fq,null extra=libB_r1.fq out=ecc_libA_r1.fq,ecc_libA_merged.fq out2=ecc_libA_r2.fq,null mode=correct

Extending contigs with reads could be done like this:

tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62


Input parameters:
in=<file>           Primary input file for reads to use as kmer data.
in2=<file>          Second input file for paired data.
extra=<file>        Extra files for use as kmer data, but not for error-
                    correction or extension.
reads=-1            Only process this number of reads, then quit (-1 means all).
NOTE: in, in2, and extra may also be comma-delimited lists of files.

Output parameters:
out=<file>          Write contigs (in contig mode) or corrected/extended
                    reads (in other modes).
outd=<file>         Write discarded reads, if using junk-removal flags.
dump=<file>         Write kmers and their counts.
fastadump=t         Write kmers and counts as fasta versus 2-column tsv.
mincounttodump=1    Only dump kmers with at least this depth.
showstats=t         Print assembly statistics after writing contigs.

Prefiltering parameters:
prefilter=0         If set to a positive integer, use a countmin sketch
                    to ignore kmers with depth of that value or lower.
prehashes=2         Number of hashes for prefilter.
prefiltersize=0.2   (pff) Fraction of memory to use for prefilter.
minprobprefilter=t  (mpp) Use minprob for the prefilter.
prepasses=1         Use this many prefiltering passes; higher be more thorough
                    if the filter is very full.  Set to 'auto' to iteratively
                    prefilter until the remaining kmers will fit in memory.
onepass=f           If true, prefilter will be generated in same pass as kmer
                    counts.  Much faster but counts will be lower, by up to
                    prefilter's depth limit.

Hashing parameters:
k=31                Kmer length (1 to infinity).  Memory use increases with K.
prealloc=t          Pre-allocate memory rather than dynamically growing;
                    faster and more memory-efficient.  A float fraction (0-1)
                    may be specified; default is 1.
minprob=0.5         Ignore kmers with overall probability of correctness below this.
minprobmain=t       (mpm) Use minprob for the primary kmer counts.
threads=X           Spawn X hashing threads (default is number of logical processors).
rcomp=t             Store and count each kmer together and its reverse-complement.

Assembly parameters:
mincountseed=3      (mcs) Minimum kmer count to seed a new contig or begin extension.
mincountextend=2    (mce) Minimum kmer count continue extension of a read or contig.
mincountretain=0    (mincr) Discard kmers with count below this.
maxcountretain=INF  (maxcr) Discard kmers with count above this.
branchmult1=20      (bm1) Min ratio of 1st to 2nd-greatest path depth at high depth.
branchmult2=3       (bm2) Min ratio of 1st to 2nd-greatest path depth at low depth.
branchlower=3       (blc) Max value of 2nd-greatest path depth to be considered low.
minextension=1      (mine) Do not keep contigs that did not extend at least this much.
mincontig=auto      (minc) Do not write contigs shorter than this.
mincoverage=1       (mincov) Do not write contigs with average coverage below this.
trimends=0          (trim) Trim contig ends by this much.  Trimming by K/2
                    may yield more accurate genome size estimation.
contigpasses=16     Build contigs with decreasing seed depth for this many iterations.
contigpassmult=1.7  Ratio between seed depth of two iterations.
ownership=auto      For concurrency; do not touch.

Processing modes:
mode=contig         contig: Make contigs from kmers.
                    extend: Extend sequences to be longer, and optionally
                            perform error correction.
                    correct: Error correct only.
                    insert: Measure insert sizes.

Extension parameters:
extendleft=100      (el) Extend to the left by at most this many bases.
extendright=100     (er) Extend to the right by at most this many bases.
ibb=t               (ignorebackbranches) Do not stop at backward branches.
extendrollback=3    Trim a random number of bases, up to this many, on reads
                    that extend only partially.  This prevents the creation
                    of sharp coverage discontinuities at branches.

Error-correction parameters:
ecc=f               Error correct via kmer counts.
reassemble=t        If ecc is enabled, use the reassemble algorithm.
pincer=f            If ecc is enabled, use the pincer algorithm.
tail=f              If ecc is enabled, use the tail algorithm.
eccfull=f           If ecc is enabled, use tail over the entire read.
aggressive=f        (aecc) Use aggressive error correction settings.
                    Overrides some other flags like errormult1 and deadzone.
conservative=f      (cecc) Use conservative error correction settings.
                    Overrides some other flags like errormult1 and deadzone.
rollback=t          Undo changes to reads that have lower coverage for
                    any kmer after correction.
markbadbases=0      (mbb) Any base fully covered by kmers with count below
                    this will have its quality reduced.
markdeltaonly=t     (mdo) Only mark bad bases adjacent to good bases.
meo=t               (markerrorreadsonly) Only mark bad bases in reads
                    containing errors.
markquality=0       (mq) Set quality scores for marked bases to this.
                    A level of 0 will also convert the base to an N.
errormult1=16       (em1) Min ratio between kmer depths to call an error.
errormult2=2.6      (em2) Alternate ratio between low-depth kmers.
errorlowerconst=3   (elc) Use mult2 when the lower kmer is at most this deep.
mincountcorrect=3   (mcc) Don't correct to kmers with count under this.
pathsimilarityfraction=0.45(psf) Max difference ratio considered similar.
                           Controls whether a path appears to be continuous.
pathsimilarityconstant=3   (psc) Absolute differences below this are ignored.
errorextensionreassemble=5 (eer) Verify this many kmers before the error as
                           having similar depth, for reassemble.
errorextensionpincer=5     (eep) Verify this many additional bases after the
                           error as matching current bases, for pincer.
errorextensiontail=9       (eet) Verify additional bases before and after
                           the error as matching current bases, for tail.
deadzone=0          (dz) Do not try to correct bases within this distance of
                    read ends.
window=12           (w) Length of window to use in reassemble mode.
windowcount=6       (wc) If more than this many errors are found within a
                    a window, halt correction in that direction.
qualsum=80          (qs) If the sum of the qualities of corrected bases within
                    a window exceeds this, halt correction in that direction.
rbi=t               (requirebidirectional) Require agreement from both
                    directions when correcting errors in the middle part of
                    the read using the reassemble algorithm.
errorpath=1         (ep) For debugging purposes.

Junk-removal parameters:
tossjunk=f          Remove reads that cannot be used for assembly.
                    This means they have no kmers above depth 1 (2 for paired
                    reads) and the outermost kmers cannot be extended.
                    Pairs are removed only if both reads fail.
tossdepth=-1        Remove reads containing kmers at or below this depth.
                    Pairs are removed if either read fails.
lowdepthfraction=0  (ldf) Require at least this fraction of kmers to be
                    low-depth to discard a read; range 0-1. 0 still
                    requires at least 1 low-depth kmer.
requirebothbad=f    (rbb) Only discard pairs if both reads are low-depth.
tossuncorrectable   (tu) Discard reads containing uncorrectable errors.
                    Requires error-correction to be enabled.

Shaving parameters:
shave=f             Remove dead ends (aka hair).
rinse=f             Remove bubbles.
maxshavedepth=1     (msd) Shave or rinse kmers at most this deep.
maxshavedepth=1     (msd) Shave or rinse kmers at most this deep.
exploredist=100     (sed) Quit after exploring this far.
discardlength=150   (sdl) Discard shavings up to this long.


Overlap parameters (for overlapping paired-end reads only):
merge=f             Attempt to merge reads before counting kmers.
ecco=f              Error correct via overlap, but do not merge reads.

Java Parameters:
-Xmx                This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
                    -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.
-eoom               This flag will cause the process to exit if an out-of-memory exception occurs.  Requires Java 8u92+.
-da                 Disable assertions.

使用例

淡水魚に寄生する甲殻類であるチョウモドキ (Argulus coregoni)RNA-seqデータを使用する。約5.4 million reads, 1.6 Gbのデータセット。クオリティトリムのステップは省いた。

seqfu stats ERR3672040_1.fastq.gz ERR3672040_2.fastq.gz
File    #Seq    Total bp        Avg     N50     N75     N90     auN     Min     Max
ERR3672040_1.fastq.gz   5405433 810814950       150.00  150     150     150     0.00    150     150
ERR3672040_2.fastq.gz   5405433 810814950       150.00  150     150     150     0.00    150     150

Tadpole.sh

tadpole.sh mode=contigでアセンブルする。2>標準エラーをファイルにリダイレクト (出力) するおまじない。

tadpole.sh mode=contig in=ERR3672040_1.fastq.gz in2=ERR3672040_2.fastq.gz out=ERR3672040.tadpole.fa 2>ERR3672040.tadpole.err

16 threads, 32GB RAMのPC上でWSL2を用いて解析したところ、30秒程度で完了した。

ERR3672040.tadpole.err
java -ea -Xmx9360m -Xms9360m -cp /home/kawato/miniconda3/bbtools/lib/current/ assemble.Tadpole mode=contig in=ERR3672040_1.fastq.gz in2=ERR3672040_2.fastq.gz out=ERR3672040.tadpole.fa
Executing assemble.Tadpole1 [mode=contig, in=ERR3672040_1.fastq.gz, in2=ERR3672040_2.fastq.gz, out=ERR3672040.tadpole.fa]

Tadpole version 37.62
Using 16 threads.
Executing kmer.KmerTableSet [mode=contig, in=ERR3672040_1.fastq.gz, in2=ERR3672040_2.fastq.gz, out=ERR3672040.tadpole.fa]

Initial:
Ways=41, initialSize=128000, prefilter=f, prealloc=f
Memory: max=9405m, free=9258m, used=147m

Initialization Time:            0.045 seconds.

Loading kmers.

Estimated kmer capacity:        255962721
After table allocation:
Memory: max=9405m, free=9160m, used=245m

After loading:
Memory: max=8724m, free=4405m, used=4319m

Input:                          10810866 reads          1621629900 bases.
Unique Kmers:                   117624409
Load Time:                      20.703 seconds.

Reads Processed:      10810k    522.18k reads/sec
Bases Processed:       1621m    78.33m bases/sec

Building contigs.

Seeding with min count = 7900, 4646, 2733, 1607, 945, 556, 327, 192, 113, 66, 38, 22, 13, 7, 4, 3

After building contigs:
Memory: max=8724m, free=3659m, used=5065m

Bases generated:                22033332
Contigs generated:              70901
Longest contig:                 4060
Contig-building time:           8.265 seconds.

Total Time:                     29.027 seconds.

COX1配列の探索

メタバーコーディングで広く用いられているcytochrome oxidase I (COI/COX1) 遺伝子の配列を探索する。

アセンブリに対する相同性検索

クエリ配列は次の通り。

BCL84741.faa
>BCL84741.1 cytochrome c oxidase subunit I (mitochondrion) [Argulus japonicus]
MPSRWLFSTNHKDIGTLYFIFGAASGMLGASLSMMIRFELFQPGSVFNNSQFYNSVVTAHAFIMIFFMVM
PIMIGGFGNWLVPLMLETPDMAFPRLNNMSFWLLPPSLSLLLMGAAVESGAGTGWTVYPPLSSIPAHSGP
SVDLSIFSLHLAGVSSILGAINFIATIFNMRAPNIELSMIPLFVWSVLITAFLLLFSLPVLAGAITMLLT
DRNLNTSFFDPTGGGDPILFQHLFWFFGHPEVYILILPGFGIVSHVIAHSSNKKEVFSKVSMGYAMAVIG
VLGFAVWAHHMFTVGLDTDTRAFFTAATMIIAIPTGIKVFSWLMTVFGSHITYTPDVLWSLGFIFLFTVG
GLTGVVLSNSSIDVALHDTYYVVAHFHYVLSMGAVFAILAGFVHWFPLLTGLTMKDSWLKIQFYMMFAGV
NITFFPQHFLGMMGMPRRYSDYPDAYYSWNMVSSVGSMISFCSTLLLIFTIMEAFYAKRKVVAPLIYSNS
MEWFHSVPPISHTYSEIPMILV

TBLASTNで検索する。

tblastn -query BCL84741.faa -subject ERR3672040.tadpole.fa  -outfmt 7 -out  BCL84741.tblastn.out

結果は以下の通り。

BCL84741.tblastn.out
# TBLASTN 2.13.0+
# Query: BCL84741.1 cytochrome c oxidase subunit I (mitochondrion) [Argulus japonicus]
# Database: User specified sequence set (Input: ERR3672040.tadpole.fa)
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 10 hits found
BCL84741.1      contig_13,length=1444,cov=6178.2,gc=0.352       87.708  480     59      0       18      497     1442    3       0.0     737
BCL84741.1      contig_29781,length=804,cov=11.3,gc=0.430       63.786  243     88      0       264     506     3       731     9.78e-85        262
BCL84741.1      contig_32173,length=545,cov=7.0,gc=0.455        66.298  181     61      0       9       189     2       544     1.93e-55        183
BCL84741.1      contig_44226,length=231,cov=6.8,gc=0.450        83.051  59      10      0       204     262     179     3       5.82e-28        105
BCL84741.1      contig_12340,length=293,cov=31.3,gc=0.334       41.026  39      21      1       183     221     111     1       1.6     28.9
BCL84741.1      contig_13058,length=2772,cov=29.1,gc=0.413      50.000  32      15      1       228     258     2588    2683    1.9     30.4
BCL84741.1      contig_18911,length=378,cov=20.0,gc=0.418       53.571  28      11      1       246     271     274     357     3.4     28.5
BCL84741.1      contig_61023,length=130,cov=3.9,gc=0.369        55.556  18      8       0       267     284     77      24      3.5     26.6
BCL84741.1      contig_8893,length=215,cov=92.4,gc=0.270        45.455  44      17      2       159     197     26      151     6.6     26.6
BCL84741.1      contig_16860,length=218,cov=42.4,gc=0.454       45.455  22      12      0       443     464     208     143     7.4     26.2
# BLAST processed 1 queries

当該配列の抽出

seqfu grepで当該配列を抽出する。

seqfu grep -n "contig_13," ERR3672040.tadpole.fa >contig_13.fa

結果は以下の通り。

contig_13.fa
>contig_13,length=1444,cov=6178.2,gc=0.352
GGAACTCTGTGGAATCATTCTATAGAATTTCTATAAATTAGGGGTGCAACAATTTTTCGTTTAGTGTAGAAGGCTTCTATAATGGTAAAGATTAGAAGCAGAGTAGAGCAAAATGAAATTAGAGACCCCACAGAAGAAACTATATTTCATGAATAGTAAGCATCAGGGTAATCTGAATAACGTCGAGGCATTCCTATTATTCCTAAAAAGTGTTGAGGAAAAAATGTAAGATTTACTCCTGTAAATATTAAAGTAAATTGAATTTTTAGTCATGAGTCTTTTATAGTTAATCCTGTTAGCAATGGGAACCAGTGTACAAATCCTGCTAAAATGGCGAATACGGCTCCTATAGATAAGACATAGTGAAAATGTGCTACTACATAGTAAGTGTCATGAAGGGCTACATCAATAGATGAGTTAGAAAGAACAACACCTGTTAATCCTCCTACTGTGAATAAAAAAATAAAACCTAAGGATCAAAGAACATCTGGGGTATAAGTGATATGTCTACCAAATACAGTTATTAGTCATCTAAAAACTTTAATTCCTGTGGGGATAGCAATAATTATAGTAGCAGCAGTAAAGAAAGCTCGGGTGTCTGTGTCTAAACCTACAGTGAATATATGGTGAGCTCATACAGCAAATCCTAATACTCCAATAACTGCTATAGCGTACCCTATTCTTACTTTTCTAAACACTTCTTTTTTATTAGATCTATGAGCAATTACATGAGATACAATTCCAAAGCCGGGAAGAATTAAAATATAAACTTCGGGATGACCGAAAAATCAAAATAAATGTTGAAACAAGATTGGGTCTCCTCCACCTGTGGGATCAAAGAAAGATGTATTTAGGTTTCGATCTGTAAGAAGCATTGTAATGGCCCCTGCTAATACTGGTAAAGAGAACAAAAGAAGGAAAGCTGTAATTAAAACTGATCATACAAAAAGTGGGATTATTGATAGTTCAATATTAGGGGCTCGTATGTTGAAAATTGTGGCAATAAAATTAATTGCCCCTAAAATTGATGAAATTCCTGCTAAGTGGAGAGAGAAAATTCTGAGATCTACTCTAGGGCCTCTATGGGCTGGGATCCTTCTTAATGGAGGGTAAACAGTTCATCCTGTTCCTGCTCCTGATTCTACTGCAGCTCCTATTAGAAGAAGACTGAGCGAAGGAGGAAGAAGTCAAAATCTTATATTATTAAGTCGTGGGAAAGCTATATCGGGAGTTTCTAATATTAAGGGTACTAGTCAATTACCAAATCCCCCAATTATAATAGGTATAACTATGAAAAAAATTATAATGAATGCGTGGGCAGTTACTACTCTGTTGTAGAATTGACTGTTATTAAAAACAGATCCGGGTTGAAAAAGTTCAAACCGAATTATCATTCTTAGCGAAGCTCCTAGTATTCCTGATGCTGCACCAAAGATGAAGTAAA

NCBI BLASTXで確認

NCBI BLASTXで検索し、COX1配列が得られたことを確認した。
image.png

注意点

あくまで簡易的な方法のため、キメラ配列などのエラーが多く含まれる可能性がある。データの解釈には注意が必要と思われる。またペアエンドリードの情報を活用したrepeat resolutionやscaffoldingを行わないため、得られる配列は必然的に断片的となる。全長配列を得たい場合はTrinitySPAdesなどのアセンブラを用いるべきであろう。

References

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?