More than 5 years have passed since last update.

配列クラスタリングCD-HIT・BLASTの出力clstr形式をbed形式に変換する

Last updated at 2019-11-29Posted at 2019-11-29

CD-HITとは

・配列クラスタリングプログラム(http://weizhongli-lab.org/cd-hit/)

入力ファイル形式

・クラスタリングを実行させたい配列ファイルをFASTA形式で入力

出力ファイル形式

２種類
・クラスタリング実行後のFASTAファイル
・クラスター番号付きの.clstrファイル

問題

clstrファイルが扱いづらい。BLAST(https://blast.ncbi.nlm.nih.gov/Blast.cgi )の出力ファイルも同様。
→BED形式にしてBEDTools(https://bedtools.readthedocs.io/en/latest/ )を駆使したい。

つまり、csplitコマンド（https://linuxcommand.net/csplit/ ）などでクラスターごとに分割後の、

hogehoge_clstr0.clstr

>Cluster 0
0	570nt, >chr1:7662318-7662888... at -/95.96%
1	178nt, >chr1:45648847-45649025... at +/99.44%
2	623nt, >chr1:51329409-51330032... at -/97.11%
3	187nt, >chr1:58841900-58842087... at -/98.40%
4	180nt, >chr1:60684077-60684257... at -/98.89%
5	629nt, >chr1:61108647-61109276... at +/96.98%
6	218nt, >chr1:61241171-61241389... at -/95.41%
7	748nt, >chr1:65918300-65919048... at -/96.79%
8	393nt, >chr1:67547311-67547704... at -/96.95%

これを

hogehoge_clstr0.bed

chr1	7662318	7662888
chr1	45648847	45649025
chr1	51329409	51330032
chr1	58841900	58842087
chr1	60684077	60684257
chr1	61108647	61109276
chr1	61241171	61241389
chr1	65918300	65919048
chr1	67547311	67547704

こうしたい。

解決法

今回はCD-HIT出力のヒトの常染色体のみを前提にしてるけど、
grepの正規表現をお好きなようにすれば生物種問わずBLASTにも対応できる。

grep -oでは正規表現にマッチした部分のみ抽出する。これを利用する。

bash

cat hogehoge_clstr0.clstr | grep -o '[0-9]\+:[0-9]\+-[0-9]\+' | tr ':-' '\t' | sort -k1,1n -k2,2n | awk '{print "chr"$0}' > hogehoge_clstr0.bed

シェルは偉大なり。
以上。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up