More than 1 year has passed since last update.

antiSMASH ver7のGoogle Colaboratoryでの利用

Last updated at 2024-02-06Posted at 2023-05-10

antiSMASHは生物の持つ二次代謝産物生合成遺伝子の網羅的な解析ツールです。
主にTechnical University of Denmark、Wageningen UniversityとLeiden Universityのグループにより開発されmasita。
antiSMASH version 7が公開されたことに伴い、antiSMASHをGoogle Colaboraotryで利用するときの方法を書き直しました。
antiSMASHに関しては以下のページを参考にしてください。

引用文献
https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad344/7151336?login=true

antiSMASHのホームページ (ウェブ版、特に問題がなければこちらの利用でも良いと思います。)
https://antismash.secondarymetabolites.org/

また、version 6の利用方法は以下のリンクで解説しています。
https://qiita.com/ykguitar1002000/items/b4630064b7713df6ef7a

基本的な操作はこのページに記載されている内容とほとんど変更ありません。

antiSMASH version 7のインストール

最初のセルでantiSMASHのインストールを行います。

%%bash
wget https://dl.secondarymetabolites.org/releases/7.0.0/antismash-7.0.0.tar.gz
tar -zxf antismash-7.0.0.tar.gz
pip install ./antismash-7.0.0

antiSMASHの動作に必要なプログラムのインストール

次にantiSMASHの動作に必要なプログラムをインストールします。
version 6はbiopythonのversionを1.79に指定する必要がありましたが、今回のversionでは指定しなくても動作しました。(追記、biopythonに関してはantiSMASHのインストールの過程で自動で必要なversionをインストールしているようです。なので、biopythonのinstallは不要です。)
長期的にはversionを指定する必要が出てくるかもしれません。
version 6ではcondaでinstallする方法を紹介していたが、sudo apt-getを利用した方が早く簡便であるため、今回はこの方法を利用する。

%%bash
sudo apt-get install hmmer2 hmmer diamond-aligner fasttree prodigal ncbi-blast+ muscle #glimmerhmm
#pip install biopython #不要なようである。

現状ではMEME-suiteをインストールしていないため、関連のプログラムを動作させることができません。
*MEMEに関しては後日追記を検討しています。
*glimmerhmmのインストールできていないようである。

antiSMASH databaseのダウンロード

次に以下のコマンドでantiSMASH dtabaseをダウンロードします。
まあまあ時間がかかります。

%%bash
download-antismash-databases

必要なものがインストールされているか確認

以下のコードでインストールされたソフトが揃っているか確認する。

!antismash --check-prereqs

以下の出力がでます。

ERROR    10/05 03:06:48   antismash.detection.cassis: preqrequisite failure: Failed to locate executable for 'meme'
ERROR    10/05 03:06:48   antismash.detection.cassis: preqrequisite failure: Failed to locate executable for 'fimo'
Some module prerequisites not satisfied

MEMEをインストールしていないため、これらの項目に関してはエラーがでます。
MEMEを使わない機能に関しては利用可能です。

ヘルプの確認

ヘルプにより、オプションコマンドを確認できます。

!antismash -h

以下のような出力がされます。
使用可能なoptionのリストが確認できます。
CASSISとRODEOを利用したい場合はMEMEのインストールが必要になります。
このページの設定ではこれらが利用できません。

########### antiSMASH 7.0.0 #############

usage: antismash [-h] [options ..] sequence

arguments:
  SEQUENCE
    GenBank/EMB
    L/FASTA
    file(s)
    containing
    DNA.

--------
Options
--------
Basic analysis options:

  --taxon {bacteria,fungi}
    Taxonomic classification of input sequence.(default: bacteria)

Additional analysis:

  --fullhmmer
    Run a whole-genome HMMer analysis.
  --cassis
    Motif based prediction of SM gene cluster regions.
  --clusterhmmer
    Run a cluster-limited HMMer analysis.
  --tigrfam
    Annotate clusters using TIGRFam profiles.
  --asf
    Run active site finder analysis.
  --cc-mibig
    Run a comparison against the MIBiG dataset
  --cb-general
    Compare identified clusters against a database of antiSMASH-predicted clusters.
  --cb-subclusters
    Compare identified clusters against known subclusters responsible for synthesising precursors.
  --cb-knownclusters
    Compare identified clusters against known gene clusters from the MIBiG database.
  --pfam2go
    Run Pfam to Gene Ontology mapping module.
  --rre
    Run RREFinder precision mode on all RiPP gene clusters.
  --smcog-trees
    Generate phylogenetic trees of sec. met. cluster orthologous groups.
  --tfbs
    Run TFBS finder on all gene clusters.
  --tta-threshold TTA_THRESHOLD
    Lowest GC content to annotate TTA codons at (default:0.65).

Output options:

  --output-dir OUTPUT_DIR
    Directory to write results to.
  --output-basename OUTPUT_BASENAME
    Base filename to use for output files within the output directory.
  --html-title HTML_TITLE
    Custom title for the HTML output page (default is input filename).
  --html-description HTML_DESCRIPTION
    Custom description to add to the output.
  --html-start-compact
    Use compact view by default for overview page.

Gene finding options (ignored when ORFs are annotated):

  --genefinding-tool {glimmerhmm,prodigal,prodigal-m,none,error}
    Specify algorithm used for gene finding: GlimmerHMM, Prodigal, Prodigal Metagenomic/Anonymous mode, or none. The 'error' option will raise an error if genefinding is attempted. The 'none' option will not run genefinding.(default: error).
  --genefinding-gff3 GFF3_FILE
    Specify GFF3 file to extract features from.

antiSMASHの実行

basicな機能は以下のコマンドで実行できる。
xxxxにファイル名、yyyyに出力先の名前を入力する。
ファイルをアップロードしてから実行すると良い。

#antiSMASHを動作させる。
!antismash xxxx.gb --output-dir yyyy

出力フォルダは以下の方法で圧縮してダウンロードすると良い。

import shutil 
shutil.make_archive('yyyy','zip',root_dir='yyyy')

optionも含めて実行する場合は以下のコマンドを実行する。
MEMEをインストールしていないため、一部動作しないoptionがあります。

#オプションも含めて動作させる場合
!antismash xxxx.gb --output-dir yyyy --fullhmmer --clusterhmmer --tigrfam --smcog-trees --cb-general --cb-subclusters --cb-knownclusters --asf --pfam2go --cc-mibig

google drive上にあるファイルを一括でかけることもできます。
この方法はversion 6の場合と基本的に同じです。
念の為、転載しております。

#必要なモジュールのインポート
from google.colab import drive
import subprocess
from subprocess import Popen
import os

#ドライブのマウント、googleから許可を要求される
drive.mount('/content/drive')

#yyyyに自身のデータの格納先を指定する。
DRIVE = 'MyDrive/yyyy'

#DRIVEの最後に「/」がない場合は追加する
if DRIVE[-1] == '/':
  pass
else:
  DRIVE = DRIVE + '/'
DRIVE_2 = '/content/drive/' + DRIVE

#antiSMASHを実行するためのdefを作成する。オプションもフルに利用する設定である。利用しない場合は「anti_smash_opt = ''」とするとよい。 
def antiS(fn,drive):
  anti_smash_opt = ' --fullhmmer --clusterhmmer --tigrfam --smcog-trees --cb-general --cb-subclusters --cb-knownclusters --asf --pfam2go --cc-mibig'
  nn = fn.rfind('.') # [.]の位置の取得
  out = fn[:nn]
  cmd = 'antismash ' + drive + fn + ' --output-dir ' + drive + out + anti_smash_opt
  print(cmd)#コマンドを出力する。
  test = subprocess.Popen(cmd, shell=True, text=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  outs, errs = test.communicate()
  if drive[-1] == '/':
    drive_2 = drive
    pass
    #print('good')
  else:
    drive_2 = drive + '/'
  #念の為logをlog.txtとして記録する。
  with open(drive_2 + 'log.txt','a') as f:
    f.write(outs)
    f.write('\n')
    f.write(errs)
    f.write('\n')

#格納先のファイルのリストを取得する。この設定ではgbとgbkファイルのみリスト化する。
files = os.listdir(DRIVE_2)
files2 = []  #これにフォルダの中の情報が格納される
#目的のファイルのみを持つリストを作る
for i in files:
    if '.gb' in i:
      files2.append(i)
    elif '.gbk' in i:
      files2.append(i)
#ファイルのリストを確認のためにプリントする。
print(files2)

#antiSMASHの実行
for i in files2:
  antiS(i,DRIVE_2)

その他、細かい設定をしたい場合の機能

上述したコマンド以外にもantiSMASHは細かい設定が可能である。
すべてのコマンドは以下の方法で確認できる。

!antismash --help-showall

いかにいくつか実用的なコマンドを紹介する。

以下のコマンドで、遺伝子の検出の閾値を設定できる。strict,relaxed,looseの３段階の設定が可能であり、デフォルトではrelaxedになっている。

#閾値を緩くしたい場合。
!antismash xxxx.gb --output-dir yyyy --hmmdetection-strictness loose

#閾値を厳しくしたい場合。
!antismash xxxx.gb --output-dir yyyy --hmmdetection-strictness strict

過去に実施した解析に追加で解析を実施した場合

配列を入力せず、出力結果フォルダの中のjsonファイルを指定するとよさそう。

#一度、解析を実施したものをベースに再度解析を行いたい場合。yyyyはフォルダの名前。xxxxは配列の名前になる。
!antismash --reuse-results　yyyy/xxxx.json --output-dir yyyy

マイナーなコマンドを全て含んだヘルプの出力

マイナーな検索コマンドを含めたヘルプは以下の方法で出力できます。

!antismash --help-showall

以下はその出力結果。あまり使うことはないかもしれませんが、参考までに紹介しておきます。

########### antiSMASH 7.0.0 #############

usage: antismash [--taxon {bacteria,fungi}] [--output-dir OUTPUT_DIR]
                 [--output-basename OUTPUT_BASENAME] [--reuse-results PATH] [--limit LIMIT]
                 [--minlength MINLENGTH] [--start START] [--end END] [--databases PATH]
                 [--write-config-file PATH] [--without-fimo]
                 [--executable-paths EXECUTABLE=PATH,EXECUTABLE2=PATH2,...]
                 [--allow-long-headers | --no-allow-long-headers] [-v] [-d] [--logfile PATH]
                 [--list-plugins] [--check-prereqs] [--limit-to-record RECORD_ID] [-V]
                 [--profiling] [--skip-sanitisation] [--skip-zip-file] [--minimal]
                 [--enable-genefunctions] [--enable-lanthipeptides] [--enable-lassopeptides]
                 [--enable-nrps-pks] [--enable-sactipeptides] [--enable-t2pks]
                 [--enable-thiopeptides] [--enable-tta] [--enable-html] [--fullhmmer]
                 [--fullhmmer-pfamdb-version FULLHMMER_PFAMDB_VERSION]
                 [--hmmdetection-strictness {strict,relaxed,loose}]
                 [--hmmdetection-fungal-cutoff-multiplier HMMDETECTION_FUNGAL_CUTOFF_MULTIPLIER]
                 [--hmmdetection-fungal-neighbourhood-multiplier HMMDETECTION_FUNGAL_NEIGHBOURHOOD_MULTIPLIER]
                 [--sideload JSON] [--sideload-simple ACCESSION:START-END]
                 [--sideload-by-cds LOCUS1,LOCUS2,...] [--sideload-size-by-cds NUCLEOTIDES]
                 [--cassis] [--clusterhmmer]
                 [--clusterhmmer-pfamdb-version CLUSTERHMMER_PFAMDB_VERSION] [--tigrfam] [--asf]
                 [--cc-mibig] [--cc-custom-dbs FILE1,FILE2,...] [--cb-general] [--cb-subclusters]
                 [--cb-knownclusters] [--cb-nclusters count] [--cb-min-homology-scale LIMIT]
                 [--pfam2go] [--rre] [--rre-cutoff RRE_CUTOFF] [--rre-minlength RRE_MIN_LENGTH]
                 [--smcog-trees] [--tfbs] [--tfbs-pvalue TFBS_PVALUE] [--tfbs-range TFBS_RANGE]
                 [--tta-threshold TTA_THRESHOLD] [--html-title HTML_TITLE]
                 [--html-description HTML_DESCRIPTION] [--html-start-compact]
                 [--genefinding-tool {glimmerhmm,prodigal,prodigal-m,none,error}]
                 [--genefinding-gff3 GFF3_FILE] [-h] [--help-showall] [-c CPUS]
                 [SEQUENCE ...]


arguments:
  SEQUENCE  GenBank/EMBL/FASTA file(s) containing DNA.

--------
Options
--------
options:

  -h, --help            Show this help text.
  --help-showall        Show full lists of arguments on this help text.
  -c CPUS, --cpus CPUS  How many CPUs to use in parallel. (default: 2)

Basic analysis options:

  --taxon {bacteria,fungi}
                        Taxonomic classification of input sequence. (default: bacteria)

Additional analysis:

  --fullhmmer           Run a whole-genome HMMer analysis.
  --cassis              Motif based prediction of SM gene cluster regions.
  --clusterhmmer        Run a cluster-limited HMMer analysis.
  --tigrfam             Annotate clusters using TIGRFam profiles.
  --asf                 Run active site finder analysis.
  --cc-mibig            Run a comparison against the MIBiG dataset
  --cb-general          Compare identified clusters against a database of antiSMASH-predicted
                        clusters.
  --cb-subclusters      Compare identified clusters against known subclusters responsible for
                        synthesising precursors.
  --cb-knownclusters    Compare identified clusters against known gene clusters from the MIBiG
                        database.
  --pfam2go             Run Pfam to Gene Ontology mapping module.
  --rre                 Run RREFinder precision mode on all RiPP gene clusters.
  --smcog-trees         Generate phylogenetic trees of sec. met. cluster orthologous groups.
  --tfbs                Run TFBS finder on all gene clusters.
  --tta-threshold TTA_THRESHOLD
                        Lowest GC content to annotate TTA codons at (default: 0.65).

Output options:

  --output-dir OUTPUT_DIR
                        Directory to write results to.
  --output-basename OUTPUT_BASENAME
                        Base filename to use for output files within the output directory.
  --html-title HTML_TITLE
                        Custom title for the HTML output page (default is input filename).
  --html-description HTML_DESCRIPTION
                        Custom description to add to the output.
  --html-start-compact  Use compact view by default for overview page.

Advanced options:

  --reuse-results PATH  Use the previous results from the specified json datafile
  --limit LIMIT         Only process the largest <limit> records (default: -1). -1 to disable
  --minlength MINLENGTH
                        Only process sequences larger than <minlength> (default: 1000).
  --start START         Start analysis at nucleotide specified.
  --end END             End analysis at nucleotide specified
  --databases PATH      Root directory of the databases (default: /usr/local/lib/python3.10/dist-
                        packages/antismash/databases).
  --write-config-file PATH
                        Write a config file to the supplied path
  --without-fimo        Run without FIMO (lowers accuracy of RiPP precursor predictions)
  --executable-paths EXECUTABLE=PATH,EXECUTABLE2=PATH2,...
                        A comma separated list of executable name->path pairs to override any on
                        the system path.E.g. diamond=/alternate/path/to/diamond,hmmpfam2=hmm2pfam
  --allow-long-headers, --no-allow-long-headers
                        Should sequence identifiers longer than 16 characters be allowed (default:
                        True)
  --hmmdetection-strictness {strict,relaxed,loose}
                        Defines which level of strictness to use for HMM-based cluster detection,
                        (default: relaxed).
  --hmmdetection-fungal-cutoff-multiplier HMMDETECTION_FUNGAL_CUTOFF_MULTIPLIER
                        Sets the multiplier for rule cutoffs in fungal inputs (default: 1.0).
  --hmmdetection-fungal-neighbourhood-multiplier HMMDETECTION_FUNGAL_NEIGHBOURHOOD_MULTIPLIER
                        Sets the multiplier for rule neighbourhoods in fungal inputs (default:
                        1.5).
  --sideload JSON       Sideload annotations from the JSON file in the given paths. Multiple files
                        can be provided, separated by a comma.
  --sideload-simple ACCESSION:START-END
                        Sideload a single subregion in record ACCESSION from START to END.
                        Positions are expected to be 0-indexed, with START inclusive and END
                        exclusive.
  --sideload-by-cds LOCUS1,LOCUS2,...
                        Sideload a subregion around each CDS with the given locus tags.
  --sideload-size-by-cds NUCLEOTIDES
                        Additional padding, in nucleotides, of subregions to create for sideloaded
                        subregions by CDS. (default: 20000)

Debugging & Logging options:

  -v, --verbose         Print verbose status information to stderr.
  -d, --debug           Print debugging information to stderr.
  --logfile PATH        Also write logging output to a file.
  --list-plugins        List all available sec. met. detection modules.
  --check-prereqs, --prepare-data
                        Check if all prerequisites are met, preparing data files where possible.
  --limit-to-record RECORD_ID
                        Limit analysis to the record with ID record_id
  -V, --version         Display the version number and exit.
  --profiling           Generate a profiling report, disables multiprocess python.
  --skip-sanitisation   Skip input record sanitisation. Use with care.
  --skip-zip-file       Do not create a zip of the output

Debugging options for cluster-specific analyses:

  --minimal             Only run core detection modules, no analysis modules unless explicitly
                        enabled
  --enable-genefunctions
                        Enable Gene function annotations (default: enabled, unless --minimal is
                        specified)
  --enable-lanthipeptides
                        Enable Lanthipeptides (default: enabled, unless --minimal is specified)
  --enable-lassopeptides
                        Enable lassopeptide precursor prediction (default: enabled, unless
                        --minimal is specified)
  --enable-nrps-pks     Enable NRPS/PKS analysis (default: enabled, unless --minimal is specified)
  --enable-sactipeptides
                        Enable sactipeptide detection (default: enabled, unless --minimal is
                        specified)
  --enable-t2pks        Enable type II PKS analysis (default: enabled, unless --minimal is
                        specified)
  --enable-thiopeptides
                        Enable Thiopeptides (default: enabled, unless --minimal is specified)
  --enable-tta          Enable TTA detection (default: enabled, unless --minimal is specified)
  --enable-html         Enable HTML output (default: enabled, unless --minimal is specified)

Full HMMer options:

  --fullhmmer-pfamdb-version FULLHMMER_PFAMDB_VERSION
                        PFAM database version number (e.g. 27.0) (default: latest).

Cluster HMMer options:

  --clusterhmmer-pfamdb-version CLUSTERHMMER_PFAMDB_VERSION
                        PFAM database version number (e.g. 27.0) (default: latest).

TIGRFam options:

ClusterCompare options:

  --cc-custom-dbs FILE1,FILE2,...
                        A comma separated list of database config files to run with

ClusterBlast options:

  --cb-nclusters count  Number of clusters from ClusterBlast to display, cannot be greater than
                        50. (default: 10)
  --cb-min-homology-scale LIMIT
                        A minimum scaling factor for the query BGC in ClusterBlast results. Valid
                        range: 0.0 - 1.0. Warning: some homologous genes may no longer be visible!
                        (default: 0.0)

NRPS/PKS options:

RREfinder options:

  --rre-cutoff RRE_CUTOFF
                        Bitscore cutoff for RRE pHMM detection (default: 25.0).
  --rre-minlength RRE_MIN_LENGTH
                        Minimum amino acid length of RRE domains (default: 50).

Transcription Factor Binding Site options:

  --tfbs-pvalue TFBS_PVALUE
                        P-value for TFBS threshold setting (default: 1e-05).
  --tfbs-range TFBS_RANGE
                        The allowable overlap with gene start positions for TFBSs in coding
                        regions (default: 50).

Gene finding options (ignored when ORFs are annotated):

  --genefinding-tool {glimmerhmm,prodigal,prodigal-m,none,error}
                        Specify algorithm used for gene finding: GlimmerHMM, Prodigal, Prodigal
                        Metagenomic/Anonymous mode, or none. The 'error' option will raise an
                        error if genefinding is attempted. The 'none' option will not run
                        genefinding. (default: error).
  --genefinding-gff3 GFF3_FILE
                        Specify GFF3 file to extract features from.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up