NCBIが提供している、生物学的配列データを検索、ダウンロード出来るコマンドラインツールを紹介
注)
記事作成時点 (2020-10-05) で開発途中のツールなので、
この記事で紹介している内容は後々変更される可能性が高いです。
最新情報は公式ページやツールのhelpを参照ください。
https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/
インストール
linuxでのインストール方法
他のやり方は公式ページを参照
1.ツールをダウンロード
curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
2.実行権限を付与
chmod +x datasets
これでインストール完了
バージョン確認
datasets version
9.0.0
ヘルプ確認
datasets help
datasets is a command line application to query and download biological sequence data
across all domains of life from NCBI databases.
To interactively browse available datasets,
visit https://www.ncbi.nlm.nih.gov/datasets/.
To write your own application that uses the same web-services,
visit https://api.ncbi.nlm.nih.gov/datasets/v1alpha/.
For detailed documentation, visit https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/
Usage:
datasets [command]
Available Commands:
assembly-descriptors Retrieve descriptions of available genome assembly datasets
download Download your data in a zip archive (default filename is ncbi_dataset.zip)
gene-descriptors Retrieve metadata of available gene datasets
help Help about any command
rehydrate Rehydrate files from a zip archive (default filename is ncbi_dataset.zip)
version Print the application version and exit
virus-summary Retrieve summary of available virus genome datasets
Flags:
--config string config file (default is $HOME/.Download.yaml)
--debug Emit debugging info
-h, --help help for datasets
--proxy string API endpoint proxy
機能
command | description |
---|---|
assembly-descriptors | ゲノムアセンブリデータセットの説明を取得する |
download | データをZIPアーカイブでダウンロードする |
gene-descriptors | 遺伝子データセットの説明を取得する |
help | コマンドについてのヘルプを表示する |
rehydrate | zipアーカイブからファイルを再取得する |
version | バージョンを表示する |
virus-summary | ウイルスゲノムデータの要約を取得する |
データのダウンロードはwebツールでも可能だが、rehydrateはこのコマンドラインツールでしか行えない。
情報の解析等も行えるのでwebツールより使い方の幅は広い。
データのrehydrateって?
rehydrateは水分補給とか再水和という意味。
必要最小限のデータ (=脱水バッグ)を先に用意し、それに巨大なデータを追加ダウンロードする (=水を入れて戻す)という意味で使っている様子。お洒落。ふえるわかめ
この機能の利点は?
公式ページによると
Why use dehydrated bags and rehydration?
The dehydrated bag is one solution to the challenge of sharing, downloading and storing large datasets of genomic sequence and annotation.Because of the small file size of a dehydrated bag, downloads are fast and sharing data with collaborators is easy. For example, a dehydrated bag representing the human reference genome assembly, GRCh38, is less than 10 kb, making it small enough to easily attach to an email (or send in a text!) to a collaborator. When it's time for analysis, the recipient can rehydrate the bag to get the full dataset.
The use of dehydrated bags can also help with other common challenges such as network connectivity problems or disk space limitations. It's easy to download a dehydrated bag even on a slow internet connection or from a computer, tablet or phone with limited disk space. When you have access to a better internet connection or a larger disk on your laptop or desktop computer, rehydrate the bag to get the full dataset.
https://www.ncbi.nlm.nih.gov/datasets/docs/rehydrate/
要は配列データを小さいファイルで管理できることが利点。
メールに小型化ファイルを添付してデータを共有できたり、出先で小型化ファイルを入手して、帰宅後追加ダウンロードできたりするのは確かに便利かも。
virus-summary
COVID-19の影響を受けて、SARS-CoV-2ゲノムおよびタンパク質データセットの検索を可能にする機能が追加された。
ウイルスゲノムデータは2020-10-05現在、SARS-CoV-2(SARS2、分類ID:2697049)を含むコロナウイルス科(分類ID:11118)に限定されている。
使ってみた (対象: ヒトゲノム)
対象
Homo sapiens GRCh38.p13
accession GCF_000001405.39
taxon 9605
データを検索
- accession で検索
datasets assembly-descriptors accession GCF_000001405.39
{"assemblies":[{"assembly":{"annotation_metadata":{"file":[{"estimated_size":"49900325","type":"GENOME_GFF"},{"estimated_size":"1315360259","type":"GENOME_GBFF"},{"estimated_size":"118242932","type":"RNA_FASTA"},{"estimated_size":"26280470","type":"PROT_FASTA"},{"estimated_size":"41033486","type":"GENOME_GTF"}],"name":"NCBI Annotation Release 109.20200815","release_date":"Aug 15, 2020","release_number":"109.20200815","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/109.20200815/","source":"NCBI"},"assembly_accession":"GCF_000001405.39","assembly_category":"reference genome","assembly_level":"Chromosome","chromosomes":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un","MT"],"contig_n50":57879411,"display_name":"GRCh38.p13","estimated_size":"2383175849","org":{"assembly_counts":{"node":129,"subtree":129},"common_name":"human","key":"9606","parent_tax_id":"9605","rank":"SPECIES","sci_name":"Homo sapiens","tax_id":"9606","title":"human"},"seq_length":"3099706404","submission_date":"2019-02-28"}}],"total_count":1}
結果がjson形式で帰ってきた
- taxon id で検索 (ヒットするデータを2つに制限)
datasets assembly-descriptors taxon 9605 -l 2
{"assemblies":[{"assembly":{"annotation_metadata":{"file":[{"estimated_size":"49900325","type":"GENOME_GFF"},{"estimated_size":"1315360259","type":"GENOME_GBFF"},{"estimated_size":"118242932","type":"RNA_FASTA"},{"estimated_size":"26280470","type":"PROT_FASTA"},{"estimated_size":"41033486","type":"GENOME_GTF"}],"name":"NCBI Annotation Release 109.20200815","release_date":"Aug 15, 2020","release_number":"109.20200815","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/109.20200815/","source":"NCBI"},"assembly_accession":"GCF_000001405.39","assembly_category":"reference genome","assembly_level":"Chromosome","chromosomes":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un","MT"],"contig_n50":57879411,"display_name":"GRCh38.p13","estimated_size":"2383175849","org":{"assembly_counts":{"node":129,"subtree":129},"common_name":"human","key":"9606","parent_tax_id":"9605","rank":"SPECIES","sci_name":"Homo sapiens","tax_id":"9606","title":"human"},"seq_length":"3099706404","submission_date":"2019-02-28"}},{"assembly":{"annotation_metadata":{},"assembly_accession":"GCA_000001405.28","assembly_category":"reference genome","assembly_level":"Chromosome","chromosomes":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un","MT"],"contig_n50":57879411,"display_name":"GRCh38.p13","estimated_size":"832364165","org":{"assembly_counts":{"node":129,"subtree":129},"common_name":"human","key":"9606","parent_tax_id":"9605","rank":"SPECIES","sci_name":"Homo sapiens","tax_id":"9606","title":"human"},"seq_length":"3099734149","submission_date":"2019-02-28"}}],"total_count":132}
データをダウンロード
helpでオプションを確認
-c, --chromosomes string Comma-delimited list of chromosomes to download (default "all")
--dehydrated Download minimal package that includes data report and locations of data files. Use the rehydrate command to retrieve data files when needed.
-g, --exclude-gff3 Exclude gff3 annotation file
-p, --exclude-protein Exclude protein sequence file
-r, --exclude-rna Exclude RNA sequence data
-s, --exclude-seq Exclude genomic sequence
-f, --filename string Name of output file (default "ncbi_dataset.zip")
-h, --help help for assembly
-b, --include-gbff Include gbff annotation file, if available
-e, --include-gtf Include gtf annotation file, if available
-i, --inputfile string file to read list of assembly accessions
今回はgbffファイルだけ取得
datasets download assembly GCF_000001405.39 -g -p -r -s -b
Downloading: ncbi_dataset.zip 1.32GB done
中身を確認
unzip -l ncbi_dataset.zip
Archive: ncbi_dataset.zip
Length Date Time Name
--------- ---------- ----- ----
661 10-05-2020 05:10 README.md
289987574 10-05-2020 05:10 ncbi_dataset/data/GCF_000001405.39/genomic.gbff
2016 10-05-2020 05:12 ncbi_dataset/data/GCF_000001405.39/data_report.yaml
46647 10-05-2020 05:12 ncbi_dataset/data/GCF_000001405.39/sequence_report.yaml
384 10-05-2020 05:12 ncbi_dataset/data/dataset_catalog.json
--------- -------
290037282 5 files
ちゃんとgbffのみダウンロードできた
rehydrate機能
1.--dehydrated を付けてダウンロード
datasets download assembly GCF_000001405.39 -g -p -r -s -b --dehydrated
Downloading: ncbi_dataset.zip 2.46kB done
unzip -l ncbi_dataset.zip
Archive: ncbi_dataset.zip
Length Date Time Name
--------- ---------- ----- ----
661 10-05-2020 05:27 README.md
2016 10-05-2020 05:27 ncbi_dataset/data/GCF_000001405.39/data_report.yaml
384 10-05-2020 05:27 ncbi_dataset/data/dataset_catalog.json
408 10-05-2020 05:27 ncbi_dataset/fetch.txt
--------- -------
3469 4 files
最低限のファイルしかダウンロードされていない状態
2.解凍
unzip ncbi_dataset.zip
Archive: ncbi_dataset.zip
inflating: README.md
inflating: ncbi_dataset/data/GCF_000001405.39/data_report.yaml
inflating: ncbi_dataset/data/dataset_catalog.json
inflating: ncbi_dataset/fetch.txt
3.解凍したディレクトリに対してrehydrate
datasets rehydrate -f .
Found 2 files for rehydration
Completed 1 of 2 [=======================>------------------------] 50%
Downloading: ncbi_dataset/data/GCF_000001405.39/genomic.gbff 4.58GB error
Downloading: ncbi_dataset/data/GCF_000001405.39/sequence_report.yaml 46.6kB done
2020/10/05 18:41:20 http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="timeout"
なぜかタイムアウトしてしまったのでもう一回
datasets rehydrate -f .
Found 2 files for rehydration
Completed 2 of 2 [================================================] 100%
Downloading: ncbi_dataset/data/GCF_000001405.39/genomic.gbff 4.58GB done
Downloading: ncbi_dataset/data/GCF_000001405.39/sequence_report.yaml 46.6kB done
今度は正常にダウンロードできた
まだα版なので、動作は不安定なのかも
使ってみた (対象: ウイルスゲノム)
データを検索
sars2で検索
datasets virus-summary taxon sars2
{"assembly_count":25402,"dehydrated":{},"hydrated":{"cli_download_command_line":"datasets download virus genome taxon sars2","estimated_file_size_mb":2794,"url":"https://api.ncbi.nlm.nih.gov/datasets/v1alpha/virus/taxon/sars2/genome/download"},"record_count":25402,"resource_updated_on":"2020-10-01T22:40:27Z"}
2020年6月1日以降にリリースされたSARS-CoV-2ゲノムの数を取得
datasets virus-summary taxon SARS-CoV-2 --released-since 06/01/2020 | jq '.assembly_count' -r
20426
2020-06-01 から 2020-10-05の間で、2万件を超えるゲノムデータが提出されていた。
データをダウンロード
SARS-CoV-2のゲノムデータセットを取得
datasets download virus genome taxon SARS-CoV-2 --filename SARS2-all.zip
Downloading: SARS2-all.zip 1.08GB done
確認
unzip -l SARS2-all.zip
Archive: SARS2-all.zip
Length Date Time Name
--------- ---------- ----- ----
661 10-05-2020 06:22 README.md
1513514454 10-05-2020 06:22 ncbi_dataset/data/cds.fna
755490056 10-05-2020 06:24 ncbi_dataset/data/genomic.fna
699035685 10-05-2020 06:25 ncbi_dataset/data/protein.faa
959423928 10-05-2020 06:26 ncbi_dataset/data/protein.gpff
720498839 10-05-2020 06:27 ncbi_dataset/data/data_report.yaml
2398 10-05-2020 06:28 ncbi_dataset/data/virus_dataset.md
2088828 10-05-2020 06:28 ncbi_dataset/data/pdb/6VYB.pdb
758727 10-05-2020 06:28 ncbi_dataset/data/pdb/6VYO.pdb
66582 10-05-2020 06:28 ncbi_dataset/data/pdb/6W37.pdb
675378 10-05-2020 06:28 ncbi_dataset/data/pdb/6W4H.pdb
1258092 10-05-2020 06:28 ncbi_dataset/data/pdb/6W9C.pdb
182574 10-05-2020 06:28 ncbi_dataset/data/pdb/6W9Q.pdb
436995 10-05-2020 06:28 ncbi_dataset/data/pdb/6WEY.pdb
983583 10-05-2020 06:28 ncbi_dataset/data/pdb/6WJI.pdb
1054296 10-05-2020 06:28 ncbi_dataset/data/pdb/6WLC.pdb
448173 10-05-2020 06:28 ncbi_dataset/data/pdb/7BQY.pdb
772983 10-05-2020 06:28 ncbi_dataset/data/pdb/7BV2.pdb
1375 10-05-2020 06:28 ncbi_dataset/data/dataset_catalog.json
--------- -------
4656693607 19 files
ゲノム配列、アミノ酸配列、タンパク質配列諸々入っていた
datasets download virus genome taxon SARS-CoV-2 --filename SARS2-all.zip --dehydrated
unknown flag: --dehydrated
--dehydratedは使用できなかった。
SARS-CoV-2 に関するページはかなり充実していたので、今後も機能が更に拡大されると予想。
おわりに
軽く使ってみた限りでは、α版ながら非常に使いやすいツールだと感じた。
rehydration機能は特に便利。
開発途中のツールなので、機能、コマンドは変更される可能性が高いので注意。
datasets is currently in alpha and will be updated frequently to add new features, fix bugs, and enhance usability. Command syntax is subject to frequent changes. Please check this page often for updates.
https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/