More than 3 years have passed since last update.

ncbi datasetsをコマンドラインから利用する

bioinformatics

Last updated at 2020-10-05Posted at 2020-10-05

NCBIが提供している、生物学的配列データを検索、ダウンロード出来るコマンドラインツールを紹介

注）
記事作成時点 (2020-10-05) で開発途中のツールなので、
この記事で紹介している内容は後々変更される可能性が高いです。
最新情報は公式ページやツールのhelpを参照ください。
https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/

インストール

linuxでのインストール方法
他のやり方は公式ページを参照

1.ツールをダウンロード

curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'

2.実行権限を付与

chmod +x datasets

これでインストール完了

バージョン確認

datasets version
9.0.0

ヘルプ確認

datasets help

datasets is a command line application to query and download biological sequence data
across all domains of life from NCBI databases.

To interactively browse available datasets,
visit https://www.ncbi.nlm.nih.gov/datasets/.

To write your own application that uses the same web-services,
visit https://api.ncbi.nlm.nih.gov/datasets/v1alpha/.

For detailed documentation, visit https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/

Usage:
  datasets [command]

Available Commands:
  assembly-descriptors Retrieve descriptions of available genome assembly datasets
  download             Download your data in a zip archive (default filename is ncbi_dataset.zip)
  gene-descriptors     Retrieve metadata of available gene datasets
  help                 Help about any command
  rehydrate            Rehydrate files from a zip archive (default filename is ncbi_dataset.zip)
  version              Print the application version and exit
  virus-summary        Retrieve summary of available virus genome datasets

Flags:
      --config string   config file (default is $HOME/.Download.yaml)
      --debug           Emit debugging info
  -h, --help            help for datasets
      --proxy string    API endpoint proxy

機能

command	description
assembly-descriptors	ゲノムアセンブリデータセットの説明を取得する
download	データをZIPアーカイブでダウンロードする
gene-descriptors	遺伝子データセットの説明を取得する
help	コマンドについてのヘルプを表示する
rehydrate	zipアーカイブからファイルを再取得する
version	バージョンを表示する
virus-summary	ウイルスゲノムデータの要約を取得する

データのダウンロードはwebツールでも可能だが、rehydrateはこのコマンドラインツールでしか行えない。
情報の解析等も行えるのでwebツールより使い方の幅は広い。

データのrehydrateって？

rehydrateは水分補給とか再水和という意味。
必要最小限のデータ (=脱水バッグ)を先に用意し、それに巨大なデータを追加ダウンロードする (=水を入れて戻す）という意味で使っている様子。お洒落。~~ふえるわかめ~~

この機能の利点は?

公式ページによると

Why use dehydrated bags and rehydration?
The dehydrated bag is one solution to the challenge of sharing, downloading and storing large datasets of genomic sequence and annotation.

Because of the small file size of a dehydrated bag, downloads are fast and sharing data with collaborators is easy. For example, a dehydrated bag representing the human reference genome assembly, GRCh38, is less than 10 kb, making it small enough to easily attach to an email (or send in a text!) to a collaborator. When it's time for analysis, the recipient can rehydrate the bag to get the full dataset.

The use of dehydrated bags can also help with other common challenges such as network connectivity problems or disk space limitations. It's easy to download a dehydrated bag even on a slow internet connection or from a computer, tablet or phone with limited disk space. When you have access to a better internet connection or a larger disk on your laptop or desktop computer, rehydrate the bag to get the full dataset.
https://www.ncbi.nlm.nih.gov/datasets/docs/rehydrate/

要は配列データを小さいファイルで管理できることが利点。
メールに小型化ファイルを添付してデータを共有できたり、出先で小型化ファイルを入手して、帰宅後追加ダウンロードできたりするのは確かに便利かも。

virus-summary

COVID-19の影響を受けて、SARS-CoV-2ゲノムおよびタンパク質データセットの検索を可能にする機能が追加された。

ウイルスゲノムデータは2020-10-05現在、SARS-CoV-2（SARS2、分類ID：2697049）を含むコロナウイルス科（分類ID：11118）に限定されている。

使ってみた (対象: ヒトゲノム)

対象
Homo sapiens GRCh38.p13
accession GCF_000001405.39
taxon 9605

データを検索

accession で検索

datasets assembly-descriptors accession GCF_000001405.39

{"assemblies":[{"assembly":{"annotation_metadata":{"file":[{"estimated_size":"49900325","type":"GENOME_GFF"},{"estimated_size":"1315360259","type":"GENOME_GBFF"},{"estimated_size":"118242932","type":"RNA_FASTA"},{"estimated_size":"26280470","type":"PROT_FASTA"},{"estimated_size":"41033486","type":"GENOME_GTF"}],"name":"NCBI Annotation Release 109.20200815","release_date":"Aug 15, 2020","release_number":"109.20200815","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/109.20200815/","source":"NCBI"},"assembly_accession":"GCF_000001405.39","assembly_category":"reference genome","assembly_level":"Chromosome","chromosomes":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un","MT"],"contig_n50":57879411,"display_name":"GRCh38.p13","estimated_size":"2383175849","org":{"assembly_counts":{"node":129,"subtree":129},"common_name":"human","key":"9606","parent_tax_id":"9605","rank":"SPECIES","sci_name":"Homo sapiens","tax_id":"9606","title":"human"},"seq_length":"3099706404","submission_date":"2019-02-28"}}],"total_count":1}

結果がjson形式で帰ってきた

taxon id で検索 (ヒットするデータを2つに制限)

datasets assembly-descriptors taxon 9605 -l 2

{"assemblies":[{"assembly":{"annotation_metadata":{"file":[{"estimated_size":"49900325","type":"GENOME_GFF"},{"estimated_size":"1315360259","type":"GENOME_GBFF"},{"estimated_size":"118242932","type":"RNA_FASTA"},{"estimated_size":"26280470","type":"PROT_FASTA"},{"estimated_size":"41033486","type":"GENOME_GTF"}],"name":"NCBI Annotation Release 109.20200815","release_date":"Aug 15, 2020","release_number":"109.20200815","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/109.20200815/","source":"NCBI"},"assembly_accession":"GCF_000001405.39","assembly_category":"reference genome","assembly_level":"Chromosome","chromosomes":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un","MT"],"contig_n50":57879411,"display_name":"GRCh38.p13","estimated_size":"2383175849","org":{"assembly_counts":{"node":129,"subtree":129},"common_name":"human","key":"9606","parent_tax_id":"9605","rank":"SPECIES","sci_name":"Homo sapiens","tax_id":"9606","title":"human"},"seq_length":"3099706404","submission_date":"2019-02-28"}},{"assembly":{"annotation_metadata":{},"assembly_accession":"GCA_000001405.28","assembly_category":"reference genome","assembly_level":"Chromosome","chromosomes":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un","MT"],"contig_n50":57879411,"display_name":"GRCh38.p13","estimated_size":"832364165","org":{"assembly_counts":{"node":129,"subtree":129},"common_name":"human","key":"9606","parent_tax_id":"9605","rank":"SPECIES","sci_name":"Homo sapiens","tax_id":"9606","title":"human"},"seq_length":"3099734149","submission_date":"2019-02-28"}}],"total_count":132}

データをダウンロード

helpでオプションを確認

-c, --chromosomes string   Comma-delimited list of chromosomes to download (default "all")
    --dehydrated           Download minimal package that includes data report and locations of data files. Use the rehydrate command to retrieve data files when needed.
-g, --exclude-gff3         Exclude gff3 annotation file
-p, --exclude-protein      Exclude protein sequence file
-r, --exclude-rna          Exclude RNA sequence data
-s, --exclude-seq          Exclude genomic sequence
-f, --filename string      Name of output file (default "ncbi_dataset.zip")
-h, --help                 help for assembly
-b, --include-gbff         Include gbff annotation file, if available
-e, --include-gtf          Include gtf annotation file, if available
-i, --inputfile string     file to read list of assembly accessions

今回はgbffファイルだけ取得

datasets download assembly  GCF_000001405.39 -g -p -r -s -b

Downloading: ncbi_dataset.zip    1.32GB done

中身を確認

unzip -l ncbi_dataset.zip 

Archive:  ncbi_dataset.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      661  10-05-2020 05:10   README.md
289987574  10-05-2020 05:10   ncbi_dataset/data/GCF_000001405.39/genomic.gbff
     2016  10-05-2020 05:12   ncbi_dataset/data/GCF_000001405.39/data_report.yaml
    46647  10-05-2020 05:12   ncbi_dataset/data/GCF_000001405.39/sequence_report.yaml
      384  10-05-2020 05:12   ncbi_dataset/data/dataset_catalog.json
---------                     -------
290037282                     5 files

ちゃんとgbffのみダウンロードできた

rehydrate機能

1.--dehydrated を付けてダウンロード

datasets download assembly  GCF_000001405.39 -g -p -r -s -b --dehydrated
Downloading: ncbi_dataset.zip    2.46kB done

unzip -l ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      661  10-05-2020 05:27   README.md
     2016  10-05-2020 05:27   ncbi_dataset/data/GCF_000001405.39/data_report.yaml
      384  10-05-2020 05:27   ncbi_dataset/data/dataset_catalog.json
      408  10-05-2020 05:27   ncbi_dataset/fetch.txt
---------                     -------
     3469                     4 files

最低限のファイルしかダウンロードされていない状態

2.解凍

unzip ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
  inflating: README.md               
  inflating: ncbi_dataset/data/GCF_000001405.39/data_report.yaml  
  inflating: ncbi_dataset/data/dataset_catalog.json  
  inflating: ncbi_dataset/fetch.txt

3.解凍したディレクトリに対してrehydrate

datasets rehydrate -f .
Found 2 files for rehydration
Completed 1 of 2 [=======================>------------------------]  50%
Downloading: ncbi_dataset/data/GCF_000001405.39/genomic.gbff    4.58GB error
Downloading: ncbi_dataset/data/GCF_000001405.39/sequence_report.yaml    46.6kB done
2020/10/05 18:41:20  http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="timeout"

なぜかタイムアウトしてしまったのでもう一回

datasets rehydrate -f .
Found 2 files for rehydration
Completed 2 of 2 [================================================] 100%
Downloading: ncbi_dataset/data/GCF_000001405.39/genomic.gbff    4.58GB done
Downloading: ncbi_dataset/data/GCF_000001405.39/sequence_report.yaml    46.6kB done

今度は正常にダウンロードできた
まだα版なので、動作は不安定なのかも

使ってみた (対象: ウイルスゲノム)

データを検索

sars2で検索

datasets virus-summary taxon sars2
{"assembly_count":25402,"dehydrated":{},"hydrated":{"cli_download_command_line":"datasets download virus genome taxon sars2","estimated_file_size_mb":2794,"url":"https://api.ncbi.nlm.nih.gov/datasets/v1alpha/virus/taxon/sars2/genome/download"},"record_count":25402,"resource_updated_on":"2020-10-01T22:40:27Z"}

2020年6月1日以降にリリースされたSARS-CoV-2ゲノムの数を取得

datasets virus-summary taxon SARS-CoV-2 --released-since 06/01/2020 | jq '.assembly_count' -r

20426

2020-06-01 から 2020-10-05の間で、2万件を超えるゲノムデータが提出されていた。

データをダウンロード

SARS-CoV-2のゲノムデータセットを取得

datasets download virus genome taxon SARS-CoV-2 --filename SARS2-all.zip
Downloading: SARS2-all.zip    1.08GB done

確認

unzip -l SARS2-all.zip 
Archive:  SARS2-all.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      661  10-05-2020 06:22   README.md
1513514454  10-05-2020 06:22   ncbi_dataset/data/cds.fna
755490056  10-05-2020 06:24   ncbi_dataset/data/genomic.fna
699035685  10-05-2020 06:25   ncbi_dataset/data/protein.faa
959423928  10-05-2020 06:26   ncbi_dataset/data/protein.gpff
720498839  10-05-2020 06:27   ncbi_dataset/data/data_report.yaml
     2398  10-05-2020 06:28   ncbi_dataset/data/virus_dataset.md
  2088828  10-05-2020 06:28   ncbi_dataset/data/pdb/6VYB.pdb
   758727  10-05-2020 06:28   ncbi_dataset/data/pdb/6VYO.pdb
    66582  10-05-2020 06:28   ncbi_dataset/data/pdb/6W37.pdb
   675378  10-05-2020 06:28   ncbi_dataset/data/pdb/6W4H.pdb
  1258092  10-05-2020 06:28   ncbi_dataset/data/pdb/6W9C.pdb
   182574  10-05-2020 06:28   ncbi_dataset/data/pdb/6W9Q.pdb
   436995  10-05-2020 06:28   ncbi_dataset/data/pdb/6WEY.pdb
   983583  10-05-2020 06:28   ncbi_dataset/data/pdb/6WJI.pdb
  1054296  10-05-2020 06:28   ncbi_dataset/data/pdb/6WLC.pdb
   448173  10-05-2020 06:28   ncbi_dataset/data/pdb/7BQY.pdb
   772983  10-05-2020 06:28   ncbi_dataset/data/pdb/7BV2.pdb
     1375  10-05-2020 06:28   ncbi_dataset/data/dataset_catalog.json
---------                     -------
4656693607                     19 files

ゲノム配列、アミノ酸配列、タンパク質配列諸々入っていた

datasets download virus genome taxon SARS-CoV-2 --filename SARS2-all.zip --dehydrated
unknown flag: --dehydrated

--dehydratedは使用できなかった。

SARS-CoV-2 に関するページはかなり充実していたので、今後も機能が更に拡大されると予想。

おわりに

軽く使ってみた限りでは、α版ながら非常に使いやすいツールだと感じた。
rehydration機能は特に便利。

開発途中のツールなので、機能、コマンドは変更される可能性が高いので注意。

datasets is currently in alpha and will be updated frequently to add new features, fix bugs, and enhance usability. Command syntax is subject to frequent changes. Please check this page often for updates.
https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up