はじめに
そういうツールがあると便利だなと思っていたが、自分で作るのは面倒くさいしスキルもちょっと足りないかなと思っていた。そしたら素晴らしいツールが出てきた。
とても簡単なツールなので説明の必要もないけど少しだけ。
インストール
Pythonのツールなのでpipでインストールする。
pip intall ffq
ヘルプ
ffq -h
とくに難しいオプションはない。
ffq 0.0.2: Fetch run information from the European Nucleotide Archive (ENA).
positional arguments:
IDs Can be a SRA / ENA Run Accessions or Study Accessions, GEO Study
Accessions, DOIs or paper titles.
optional arguments:
-h, --help Show this help message and exit
-o OUT Path to JSON file to write run information. If `--split` is
used, path to directory in which to place JSON files. (default:
standard out)
-t TYPE The type of term used to query data. Can be one of SRR, ERR,
DRR, SRP, ERP, DRP, GSE, DOI (default: SRR)
--split Split runs into their own files.
--verbose Print debugging information
SRR を検索してみる
ffq SRR1000000
出力はJSON形式である。
{
"SRR1000000": {
"accession": "SRR1000000",
"experiment": {
"accession": "SRX357886",
"title": "Illumina HiSeq 2000 paired end sequencing",
"platform": "ILLUMINA",
"instrument": "Illumina HiSeq 2000"
},
"study": {
"accession": "SRP056282",
"title": "Allelic Spectrum in Common Disease:Sequence from participants in the FUSION study",
"abstract": "This study is part of a re-sequencing project to identify variants associated with metabolic syndrome traits in a Finnish cohort. Metabolic syndrome (MetS) increases the risk of cardiovascular disease and diabetes, and prevalence is estimated to be as high as 25% in the United States. MetS is characterized via measure of triglycerides (TG), high-density lipoprotein cholesterol (HDL-C), systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting plasma glucose (FG), body mass index (BMI) and waist-to-hip ratio (WHR). Closely related traits include low-density lipoprotein cholesterol (LDL-C), total cholesterol (TC), fasting plasma insulin (FI) and height (HT). Seventeen loci associated with TG, HDL-C, LDL-C, TC, FG, and FI (Kathiresan et al. 2008, Willer et al. 2008, Sabatti et al. 2009, Dupuis et al. 2010, Teslovich et al. 2010) were prioritized for sequencing. At each locus, protein-coding regions and 5'' and 3'' untranslated regions of genes... (for more see dbGaP study page.)"
},
"sample": {
"accession": "SRS485766",
"title": "DNA sample from a human female participant in the dbGaP study \"Sequence Data From Participants in the FUSION Study\"",
"organism": "Homo sapiens",
"attributes": {
"gap_accession": "phs000702",
"gap_parent_phs": "phs000867",
"submitter handle": "NIDDK",
"biospecimen repository": "NIDDK",
"study name": "Sequence Data From Participants in the FUSION Study",
"study design": "Case-Control",
"biospecimen repository sample id": "FU04357",
"submitted sample id": "FU04357",
"submitted subject id": "656230",
"gap_sample_id": "894937",
"gap_subject_id": "237778",
"sex": "female",
"analyte type": "DNA",
"gap_consent_code": "1",
"gap_consent_short_name": "GRU-IRB",
"ENA-FIRST-PUBLIC": "2013-10-01",
"ENA-LAST-UPDATE": "2018-04-12"
}
},
"title": "Illumina HiSeq 2000 paired end sequencing",
"files": []
}
}
生JSONファイルだと、情報量が多くてわかりにくいと思われる場合は、jqコマンドを用いて情報を整理すると良いと思う。
ffq SRR1000000 | jq '.SRR1000000.study.title'
# "Allelic Spectrum in Common Disease:Sequence from participants in the FUSION study"
この記事は以上です。
よかったと思った人はぜひあなたのお気に入りのツールをQiita記事で紹介してください。