7

More than 5 years have passed since last update.

@antiplastics(Koki Tsuyuzaki)

k-mer countingに関するメモ

bioinformatics

Last updated at 2014-12-31Posted at 2014-12-31

次世代シーケンサー（NGS）のデータ解析に欠く事ができない基盤技術である、k-mer counting法のまとめ

k-merとは、NGSが出力したリードの塩基配列を決まった文字数切り出したもの http://en.wikipedia.org/wiki/K-mer

ゲノム・トランスクリプトームアセンブル時のエラー補正
ゲノムサイズの推定
マルチプルアライメント（MUSCLE）
実験のQuality Control
RNAの定量化（Sailfish、RNA-Skim）

...等々、様々な箇所で利用されている。

データ内にどのk-merがどの程度あるのか数える事をk-mer countingという。k-mer countingを行うソフトウェアは以下のようなものがある。
ざっと見て、DSKやStream Algorithmのように省メモリタイプのものと、Jellyfishのような高速なタイプがある。

khmer : Bloom-filterベース、比較論文になっているから必読

Count-min Hashを利用している
http://en.wikipedia.org/wiki/Count–min_sketch
http://en.wikipedia.org/wiki/MinHash
http://www.slideshare.net/iwiwi/minhash?related=2
These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure, PLOS ONE, 2014

MSPKmerCounter : 論文が無いがDSKに似ているらしい

KmerStream : Stream Algorithmベース

KmerStream: Streaming algorithms for k-mer abundance estimation, Bioinformatics, 2014

KAnalyze : アルゴリズムは不明、Javaで実装されている、Jellyfishよりも速いらしい

KAnalyze: a fast versatile pipelined K-mer toolkit, Bioinformatics, 2014

KMC 2 : DSKと似たような感じらしい

KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2014

KMC : DSKと似たような感じらしい

Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013

DSK : Disk-Streaming Algorithmベース、省メモリ

DSK: k-mer counting with very low memory usage, Bioinformatics, 2013

Turtle : Pattern-blocked Bloom filterベース

Turtle: Identifying frequent k-mers with cache-efficient algorithms, Bioinformatics, 2013

BF Couter : Bloom-filterベース

Efficient counting of k-mers in DNA sequences using a bloom filter, BMC Bioinformatics, 2011

Jellyfish : Hashベース、Lock-free、マルチスレッド対応、速い

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 2011

GkArray : Generalized Suffix Arrayベース

Querying large read collections in main memory: a versatile data structure, BMC Bioinformatics, 2011

Tallymer : Suffix Arrayベース、トウモロコシゲノムを読むために作られた

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes, BMC Genomics ,2008

Meryl : Celeraアセンブラの内部で利用されている

Aggressive assembly of pyrosequencing reads with mates, Bioinformatics, 2008

参考

OMIC tools
Homolog.us – Bioinformatics

7

Register as a new user and use Qiita more conveniently

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

7