More than 1 year has passed since last update.

Google Colaboratoryを利用してPfamドメイン探索を行う。

Last updated at 2023-04-15Posted at 2023-04-15

Pfamドメイン探索は最も有名なタンパク質のドメイン予想を行うためのプログラムである。
http://pfam.xfam.org/

今回、タンパク質のアミノ酸配列からPfamドメイン探索をpythonを用いて実行する方法を簡単に記載する。

必要なソフトのインストール

まず、HMMERパッケージをインストールする。
http://hmmer.org/documentation.html

condaでもinstall可能であるが、sudo aptを利用した方が簡便である。

!sudo apt-get install hmmer

データベースのダウンロードと解答

以下の方法でデータベースをダウンロードする。
さらに、hmmerの機能を利用して解凍する。

!wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz
!gunzip Pfam-A.hmm.gz

!hmmpress Pfam-A.hmm

これにより、あらたに4つのファイルが出力される。
無事に出力されれば準備完了である。

解析の実行

解析に用いたファイルをfasta形式で用意し、以下のコマンドにより実行することができる。

!hmmscan --tblout out.txt Pfam-A.hmm sequence.fasta

out.txtファイルに結果が出力される。

subprocessを用いる方法

上記の方法でも実行可能だが、subprocessを用いる方法もある。

import subprocess
from subprocess import Popen

cmd = "hmmscan --tblout out.txt Pfam-A.hmm sequence.fasta"
test = subprocess.Popen(cmd, shell=True, text=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
outs, errs = test.communicate()

この、方法でも同じ結果が出力される。

参考

hmm scanのヘルプの出力

!hmmscan -h

# hmmscan :: search sequence(s) against a profile database
# HMMER 3.3 (Nov 2019); http://hmmer.org/
# Copyright (C) 2019 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmscan [-options] <hmmdb> <seqfile>

Basic options:
  -h : show brief help on version and usage

Options controlling output:
  -o <f>           : direct output to file <f>, not stdout
  --tblout <f>     : save parseable table of per-sequence hits to file <f>
  --domtblout <f>  : save parseable table of per-domain hits to file <f>
  --pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>
  --acc            : prefer accessions over names in output
  --noali          : don't output alignments, so output is smaller
  --notextw        : unlimit ASCII text output line width
  --textw <n>      : set max width of ASCII text output lines  [120]  (n>=120)

Options controlling reporting thresholds:
  -E <x>     : report models <= this E-value threshold in output  [10.0]  (x>0)
  -T <x>     : report models >= this score threshold in output
  --domE <x> : report domains <= this E-value threshold in output  [10.0]  (x>0)
  --domT <x> : report domains >= this score cutoff in output

Options controlling inclusion (significance) thresholds:
  --incE <x>    : consider models <= this E-value threshold as significant
  --incT <x>    : consider models >= this score threshold as significant
  --incdomE <x> : consider domains <= this E-value threshold as significant
  --incdomT <x> : consider domains >= this score threshold as significant

Options for model-specific thresholding:
  --cut_ga : use profile's GA gathering cutoffs to set all thresholding
  --cut_nc : use profile's NC noise cutoffs to set all thresholding
  --cut_tc : use profile's TC trusted cutoffs to set all thresholding

Options controlling acceleration heuristics:
  --max    : Turn all heuristic filters off (less speed, more power)
  --F1 <x> : MSV threshold: promote hits w/ P <= F1  [0.02]
  --F2 <x> : Vit threshold: promote hits w/ P <= F2  [1e-3]
  --F3 <x> : Fwd threshold: promote hits w/ P <= F3  [1e-5]
  --nobias : turn off composition bias filter

Other expert options:
  --nonull2     : turn off biased composition score corrections
  -Z <x>        : set # of comparisons done, for E-value calculation
  --domZ <x>    : set # of significant seqs, for domain E-value calculation
  --seed <n>    : set RNG seed to <n> (if 0: one-time arbitrary seed)  [42]
  --qformat <s> : assert input <seqfile> is in format <s>: no autodetection
  --cpu <n>     : number of parallel CPU workers to use for multithreads  [2]

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up