More than 5 years have passed since last update.

AHRDで複数のblast結果を統合して遺伝子アノテーションをつける

Last updated at 2017-04-24Posted at 2017-04-24

AHRDとは

Automated Assignment of Human Readable Descriptions (AHRD)は異なったのDBに対するblast結果から、最適なhitを選択し、さらに人間が読みやすい記述を選んで、アノテーションをしてくれる。

使用ソフト

blastp (rmblast-2.2.28)
AHRD v3.3.3

Blastをかける

色々なDBが使えるが、パラメーター設定の例がある

TrEMBL　(automatically annotated and is not reviewed)
Swiss-Prot (manually annotated and is reviewed)
closely related genome (自分で近縁種で、よくアノテーションされているものを選ぶ）

の３つのDBが使いやすい

output format はタブ区切り(fmt =6)
blast_formatterを使うと便利

blast_run.sh

# !/bin/sh
 
# $ -S /bin/bash
# $ -cwd
# $ -v PATH

blastp -outfmt 6 -query ../prot.fa \
-db swissprot -evalue 0.0001 \
-max_target_seqs 20 \
-out blast_swiss.fmt6.out

設定ファイルymlを作成する

ahrd_input.yml


proteins_fasta: ./prot.fa    #blastのクエリに使った遺伝子ファイルのフルパスをいれる
token_score_bit_score_weight: 0.468 #Blast結果のbit scoreをどれだけ重視するかの設定（いじらなくて良い）
token_score_database_score_weight: 0.2098 #同様　いじらなくて良い
token_score_overlap_score_weight: 0.3221 #同様　いじらなくて良い
output: ./ahrd_output.csv #outputファイルの名前を入れる
blast_dbs:
  swissprot: #dbの名前をつける自由につければ良い
    weight: 653 #dbの重みづけ（いじらなくてよい）
    description_score_bit_score_weight: 2.717061　 #同様　いじらなくて良い
    file: ./blast_swiss.fmt6.out #タブ区切りで出力したblastの結果ファイルを指定する
    database: /blast/db/swissprot #blastに使ったdbファイル(fasta)の場所を指定する
    fasta_header_regex: "^>(?<accession>.+?) .+?Full=(?<description>.+?)($|;.+)"　　　#dbファイルのfastaID部分の内、どの部分がアクセッション番号で、どこがdescriptionかをJAVA正規表現で指定する
    blacklist: ./blacklist_descline.txt #blacklistファイルに入っている特定ワードを含むhitは無視する。
    filter: ./filter_descline_sprot.txt #filterファイルに入っている特定ワードはdescriptionから除去する
    token_blacklist: ./blacklist_token.txt #token_blacklistに入っているワードはwordのスコアリングからは無視去れるが、もし他の部分の情報からヒットが採用となったら、出力にはそれが入ったまま表示される。

  trembl:
    weight: 904
    description_score_bit_score_weight: 2.590211
    file:  ./blast.trembl.fmt6.txt
    database: /blast/db/trembl
    blacklist: ./blacklist_descline.txt
    filter: ./filter_descline_trembl.txt
    token_blacklist: ./blacklist_token.txt
    fasta_header_regex: "^>(?<accession>.+?) +.*?Full=(?<description>.+?)($|;.+| \\{ECO.+)"

    
  aplysia:
    weight: 854
    description_score_bit_score_weight: 2.917405
    file: ./blast.aply.fmt6.out
    database: /blast_Aplysia/GCF_000002075.1_AplCal3.0_protein.faa
    blacklist: ./blacklist_descline.txt
    filter: ./filter_descline_aply.txt
    token_blacklist: ./blacklist_token.txt
    fasta_header_regex: "^>(?<accession>.+?) +(?<description>.+?)( \\[Aplysia.+)"

blacklist_descline.txtの例

blacklist_descline.txt

(?i)^similar\s+to
(?i)^probable 
(?i)^putative 
(?i)putative$
(?i)^predicted 
(?i)^uncharacterized
(?i)^uncharacterised
(?i)^TSA:
(?i)^unknown
(?i)^hypothetical
(?i)^unnamed
(?i)whole\s+genome\s+shotgun\s+sequence
(?i)^clone
(?i)[0-9][0-9][0-9][0-9][0-9]
(?i)genomic scaffold
(?i)genomic contig
(?i)genome sequencing data

(?i)で大文字小文字を区別しない設定
[0-9][0-9][0-9][0-9][0-9]で数字ばっかりのデータは除去する。

filter_descline_aply.txtの例

filter_descline_aply.txt

\sOS=.*$
(?i)OS.*[.].*protein
(?i)^H0.*protein
(?i)contains.*
IPR.*
\w{2,}\d{1,2}(g|G)\d+(\.\d)*\s+
\b\[.*
\b\S+\|\S+\|\S+
\(\s*Fragment\s*\)
^(\s|/|\(|\)|-|\+|\*|,|;|\.|\:|\||\d)+$
PREDICTED:

blacklist_token.txtの例

blacklist_token.txt

(?i)\bunknown\b
(?i)\bmember\b
(?i)\blike\b
(?i)\bassociated\b
(?i)\bcontaining\b
(?i)\bactivated\b
(?i)\bfamily\b
(?i)\bsubfamily\b
(?i)\binteracting\b
(?i)\bactivity\b
(?i)\bsimilar\b
(?i)\bproduct\b
(?i)\bexpressed\b
(?i)\bpredicted\b
(?i)\bputative\b
(?i)\buncharacterized\b
(?i)\bprobable\b
(?i)\bprotein\b
(?i)\bgene\b
(?i)\btair\b
(?i)\bfragment\b
(?i)\bhomolog\b
(?i)\bcontig\b
(?i)\brelated\b
(?i)\bremark\b
(?i)\b\w?orf(\w?|\d+)\b

runを実行する

run.AHRD.sh

# !/bin/sh
 
# $ -S /bin/bash
# $ -cwd
# $ -v PATH

java -Xmx200g -jar \
~/opt/AHRD/dist/ahrd.jar \
./ahrd_example_input.yml

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up