More than 3 years have passed since last update.

Accession No.からメタゲノムアセンブリを取得してLocal BLAST用のデータベースを作成する

Last updated at 2021-05-10Posted at 2021-04-26

目的とか

先行研究で使用されているデータセットを使いたい
サプリ等に解析に使用したデータセットが用意されてない
配列を手動でポチポチ保存するのがめんどくさい
Linux環境はWSL2上のUbuntuを使用する
ディレクトリの移動などCLIの基本操作は省略

配列の取得

Entrezのassembly dbではaccが使えないのでnucleotide db用のコードを流用してアクセッションナンバーだけ直してもbad requestと返ってくる。EsearchのTermにaccを入れて検索するとIdListにGbUidが入った結果を返してくれるのでこれをさらにEsummaryに投げるとFtpPath等が入ったDocumentSummaryを返してくれるので適当な方法¹でダウンロードする。

Accession No.の用意

先行研究の論文を取得しFigureの系統樹からOTU部分を画像編集ソフトで切り出し
Google documentのOCR機能で読み取る
体裁を整えてcsvにする
Entrezを利用して配列を取得する(使用したコードは以下に記述)

Entrezを利用してアセンブリ配列のFtpPathを取得する

Accession No.をtermに入れてEsearchに投げると検索の結果(dict型)が返ってくる
GbUidが入っているIdListを抽出してEsummaryに投げる

Python

import pandas as pd
from Bio import Entrez
Entrez.email = "foo@bar"

# EsearchでAssembly Accession No.からGbUidを取得
entries = pd.read_csv('entries.csv', header=None)[0].tolist() #手持ちのデータ形式に合わせて適宜変更
results = []
for query in entries:
    handle = Entrez.esearch(db="assembly", term=query, retmax="1") #retmax(検索結果の上限)のdefaultは20
    record = Entrez.read(handle) #クラスBio.Entrez.Parser.DictionaryElementのデータが入る
    handle.close()
    result = [record['QueryTranslation'],record['IdList']] #dict型から要素を抽出してlist型にする
    results.append(result)
df_results = pd.DataFrame(results,columns=['Query','GbUid']) #df_resultsを見ればAccNoとGbUidの対応がわかる

# df_resultsからGbUidのリストを抽出
IdList = df_results["GbUid"]

# IdListからからDocumentSummaryを取得
df_summaries = pd.DataFrame()
for Id in IdList:
    handle = Entrez.esummary(db="assembly", id=Id)
    record = Entrez.read(handle)
    handle.close()
    summary = pd.DataFrame(record['DocumentSummarySet']['DocumentSummary']) #ネストされているので取り出してDataFrameに変換
    summary['DbBuild'] = record['DocumentSummarySet']['DbBuild']
    df_summaries = pd.concat([df_summaries, summary])
df_summaries = df_summaries.reset_index(drop=True)

# 結果をsummaries.csvに保存
df_summaries.to_csv('./summaries.csv', index=False)

適当なTerm(分類群とか)で検索してヒットしたGbUidを集めたいならEsearchとGbUid作成の部分を以下のものに変更する。TermをAccession No.で検索する場合はGbUidが1つだけ返ってくるが、複数ヒットするTermを投げた場合要素にlist型の入ったDataFrameになってしまうので抽出がちょっと面倒くさい。検索したい単語セットをリスト化してまとめて投げる予定も今のところないのでfor文は使わずにリスト化する。

# Esearchで任意のTermを検索して得られたUbUidをリスト化する
query = "Any Term"
handle = Entrez.esearch(db="assembly", term=query, retmax="100") #最大100まで
record = Entrez.read(handle)
handle.close()
df_result = pd.DataFrame(record.values(), index=record.keys()).T #dict型として返ってきた結果をDataFrameに格納

# df_resultからGbUidのリストを抽出
IdList = df_result.at[0,'IdList']
IdList

取得したDocumentSummaryの中にGenBankFtpPathという列があるのでここからアセンブリデータを取得する。

アセンブリデータのダウンロード

公式FAQ¹によるとrsyncを利用する場合PATHの「ftp://」を「rsync://」に置換してする必要があるそうなので適当にリストを作成する。また、「rsync://.../...zyx」となっている末尾に「/」を追加するとディレクトリの中身だけカレントにダウンロードされる。「/」をつけないとディレクトリがそのまま個々のフォルダとして保存される。

# df_summariesからアセンブリデータのFtpPathを抽出しrsync://に置換
df_paths = df_summaries['FtpPath_GenBank'].str.replace('ftp://','rsync://')

# 結果をpaths.csvとしてカレントに保存
df_paths.to_csv('./paths.csv',header=False,index=False)

カレントにxargs²の-aオプションと-Iオプションを利用してrsyncにPATHを渡すコマンド使う
保存したい場所に移動しpaths.csvを置いて以下のコマンドを実行してテスト

xargs -a paths.csv -t -I @ rsync --dry-run --copy-links --recursive --times --verbose @ ./ &>rsynclog.txt 2>errlog.txt

テスト結果がrsynclog.txtに、エラーログはerrlog.txt保存されるので上手くいったらrsync³の--dry-runオプションを外して実行する。

ダウンロードした配列をBLAST DBに変換する

Ubuntuであればsudo apt install ncbi-blast+でBLAST+をインストールできる。
その中にmakeblastdbコマンドが含まれているのでこれを使用してデータベースを作成する。

配列データの結合

makeblastdbはデータベースを作る際に配列データを1つだけ読み込むのでゲノムが複数ある場合は1つのファイルにまとめる必要がある。シェルコマンドのcatは複数のファイルを結合・表示するコマンドで、出力をリダイレクトすることで複数ファイルを結合することができる。テキストファイルであれば読み込んだ順番に追記されるのでFasta形式等はcatで結合するだけで一つにまとめることができる。今回はgzip形式のファイルを扱うzcatコマンドを使用する。

先ほどダウンロードしたゲノムデータはXXXX_YYYY_ZZZZ.fna.gzというファイル名になっているのでファイル名をワイルドカード指定してまとめて結合する。

zcat *.fna.gz >genome-db.fna #未圧縮のファイルが出力される
zcat *.fna.gz | gzip >genome-db.fna #結合したファイル圧縮する場合はパイプラインでgzipに渡す

makedbを使用してLocal BLAST用のデータベースを作成

makeblastdbのUsageは以下の通り。最低限データベースの種類-dbtypeと読み込むファイル-inを指定する必用がある。

$ makeblastdb -help
USAGE
  makeblastdb [-h] [-help] [-in input_file] [-input_type type]
    -dbtype molecule_type [-title database_title] [-parse_seqids]
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
    [-mask_desc mask_algo_descriptions] [-gi_mask]
    [-gi_mask_name gi_based_mask_names] [-out database_name]
    [-blastdb_version version] [-max_file_sz number_of_bytes]
    [-logfile File_Name] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-version]

DESCRIPTION
   Application to create BLAST databases, version 2.9.0+

REQUIRED ARGUMENTS
 -dbtype <String, 'nucl', 'prot'>
   Molecule type of target db

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -version
   Print version number;  ignore other arguments

 *** Input options
 -in <File_In>
   Input file/database name
   Default = '-'
 -input_type <String, 'asn1_bin', 'asn1_txt', 'blastdb', 'fasta'>
   Type of the data specified in input_file
   Default = 'fasta'

 *** Configuration options
 -title <String>
   Title for BLAST database
   Default = input file name provided to -in argument
 -parse_seqids
   Option to parse seqid for FASTA input if set, for all other input types
   seqids are parsed automatically
 -hash_index
   Create index of sequence hash values.

 *** Sequence masking options
 -mask_data <String>
   Comma-separated list of input files containing masking data as produced by
   NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
 -mask_id <String>
   Comma-separated list of strings to uniquely identify the masking algorithm
    * Requires:  mask_data
    * Incompatible with:  gi_mask
 -mask_desc <String>
   Comma-separated list of free form strings to describe the masking algorithm
   details
    * Requires:  mask_id
 -gi_mask
   Create GI indexed masking data.
    * Requires:  parse_seqids
    * Incompatible with:  mask_id
 -gi_mask_name <String>
   Comma-separated list of masking data output files.
    * Requires:  mask_data, gi_mask

 *** Output options
 -out <String>
   Name of BLAST database to be created
   Default = input file name provided to -in argumentRequired if multiple
   file(s)/database(s) are provided as input
 -blastdb_version <Integer, 4..5>
   Version of BLAST database to be created
   Default = '4'
 -max_file_sz <String>
   Maximum file size for BLAST database files
   Default = '1GB'
 -logfile <File_Out>
   File to which the program log should be redirected

 *** Taxonomy options
 -taxid <Integer, >=0>
   Taxonomy ID to assign to all sequences
    * Incompatible with:  taxid_map
 -taxid_map <File_In>
   Text file mapping sequence IDs to taxonomy IDs.
   Format:<SequenceId> <TaxonomyId><newline>
    * Requires:  parse_seqids
    * Incompatible with:  taxid

今回は-dbtype -in -parse_seqidsを指定した。データベース名は指定しなければファイル名が使用される。

$ makeblastdb -dbtype nucl -in genome-db.fna -parse_seqids

コマンドを実行するとgenome-db.fna.nhr genome-db.fna.nin genome-db.fna.nog genome-db.fna.nsd genome-db.fna.nsi genome-db.fna.nsqという6つのファイルが生成されるのでLocal BLASTを行うときはこれを指定する。

NCBIはrsyncを推奨している。 ↩ ↩²
xargsについてはここ。 ↩
rsyncについてはここ。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up