Help us understand the problem. What is going on with this article?

cromwell でjoint-discovery-gatk4-localを実行する

cwltools でjoint-discovery-gatk4-localを実行する

https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.wdl を実行する

これは、下記の一連のステップの最終段階としてjoint-discovery-gatk4-localを処理するものである
cromwell を手元のOS Xにインストールする

cromwell でGATKのワークフローgatk4-data-processing-1.0.1/processing-for-variant-discovery-gatk4.wdlを実行する
cromwell を利用してWorkflows for germline short variant discovery with GATK4を実行する

ただし、この処理には100GBのメモリを要求するプロセスが存在するため、今回は下記のような構成のシステムを利用する

OS: Red Hat Enterprise Linux Server release 7.1
CPU: 40 cores
Memory: 1TB

CPU コア数とメモリ容量の確認

Cores
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.1 (Maipo)
$ grep processor /proc/cpuinfo | wc -l
40
$ grep 'MemTotal' /proc/meminfo
MemTotal:       1056749896 kB

手順概要

  • gatk4-germline-snps-indelsをダウンロード
    • 以前にダウンロードしていればこの手順は不要
  • 必要なファイルをgoogle storageから手元にダウンロード
    • 下記の2つのファイルからユーティリティーdownload_gs.py(V1.1.1以上)を利用してファイルを取得する
    • gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4.hg38.wgs.inputs.json に記載のgs://から始まるファイル
    • gatk-test-data/joint_discovery/NA12878.sample_map に記載のファイル
    • gs://gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi というインデックスファイル
  • 入力ファイルgatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.jsonを作成
    • gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs.jsonを元に下記の点を変更
    • フォルダーの構成をダウンロードした構成に合致するように変更

必要なファイルをgoogle storageから手元にダウンロード

計算に必要なファイルをgoogle storageからダウンロードするために、download_gs.pyを利用する

この際に利用するのはgatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4.hg38.wgs.inputs.json

$ python3 download_gs.py gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4.hg38.wgs.inputs.json

上記で取得したgatk-test-data/joint_discovery/NA12878.sample_map中のデータファイルを取得する

$ python3 download_gs.py gatk-test-data/joint_discovery/NA12878.sample_map --non_json

gs://gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi のダウンロード

更にNA12878.sample_mapに含まれるNA12878.g.vcf.gzのインデックスファイルNA12878.g.vcf.gz.tbiを取得しておく

$ echo "gs://gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi" > index.txt
$ python3 download_gs.py index.txt --non_json

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.jsonの作成

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs.jsonを元に処理を行うが、先程入手したテストデータのフォルダー構造とは異なっているので、詳細部分の調整を行う必要がある
ファイルのフォルダ指定を調整したファイルgatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.jsonを作成した

変更点が多いため結果のファイルの全体を示す

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.json
{
  "##_COMMENT1": "INPUT GVCFs & COHORT -- DATASET-SPECIFC, MUST BE ADAPTED",
  "JointGenotyping.sample_names": ["NA12878"],
  "JointGenotyping.input_gvcfs": ["gatk-test-data/joint_discovery/NA12878.g.vcf.gz"],
  "JointGenotyping.input_gvcfs_indices": ["gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi"],
  "JointGenotyping.callset_name": "NA12878",

  "##_COMMENT2": "REFERENCE FILES",
  "JointGenotyping.ref_fasta": "broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
  "JointGenotyping.ref_fasta_index": "broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
  "JointGenotyping.ref_dict": "broad-references/hg38/v0/Homo_sapiens_assembly38.dict",

  "##_COMMENT3": "INTERVALS",
  "JointGenotyping.eval_interval_list": "broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",
  "JointGenotyping.unpadded_intervals_file": "gatk-test-data/intervals/hg38.even.handcurated.20k.intervals",

  "##_COMMENT4": "RESOURCE FILES",
  "JointGenotyping.dbsnp_vcf": "broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
  "JointGenotyping.dbsnp_vcf_index": "broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
  "JointGenotyping.one_thousand_genomes_resource_vcf": "broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
  "JointGenotyping.one_thousand_genomes_resource_vcf_index": "broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi",
  "JointGenotyping.omni_resource_vcf": "broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz",
  "JointGenotyping.omni_resource_vcf_index": "broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi",
  "JointGenotyping.mills_resource_vcf": "broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
  "JointGenotyping.mills_resource_vcf_index": "broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
  "JointGenotyping.axiomPoly_resource_vcf": "broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz",
  "JointGenotyping.axiomPoly_resource_vcf_index": "broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.tbi",
  "JointGenotyping.hapmap_resource_vcf": "broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz",
  "JointGenotyping.hapmap_resource_vcf_index": "broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi",

  "##_COMMENT5": "VQSR PARAMETERS",
  "JointGenotyping.SNP_VQSR_downsampleFactor": 10,
  "JointGenotyping.snp_filter_level": 99.7,
  "JointGenotyping.indel_filter_level": 99.7,
  "JointGenotyping.indel_recalibration_annotation_values": ["FS", "ReadPosRankSum", "MQRankSum", "QD", "SOR", "DP"],
  "JointGenotyping.indel_recalibration_tranche_values": ["100.0", "99.95", "99.9", "99.5", "99.0", "97.0", "96.0", "95.0", "94.0", "93.5", "93.0", "92.0", "91.0", "90.0"],
  "JointGenotyping.snp_recalibration_tranche_values": ["100.0", "99.95", "99.9", "99.8", "99.6", "99.5", "99.4", "99.3", "99.0", "98.0", "97.0", "90.0" ],
  "JointGenotyping.snp_recalibration_annotation_values": ["QD", "MQRankSum", "ReadPosRankSum", "FS", "MQ", "SOR", "DP"],

  "##_COMMENT4": "DOCKERS",
  "JointGenotyping.python_docker": "python:2.7",
  "JointGenotyping.gatk_docker": "broadinstitute/gatk:4.0.6.0",

  "##_COMMENT5": "PATHS",
  "JointGenotyping.gatk_path": "/gatk/gatk",

  "##_COMMENT8": "DISK SIZE ALLOCATION",
  "JointGenotyping.small_disk": 100,
  "JointGenotyping.medium_disk": 200,
  "JointGenotyping.large_disk": 300,
  "JointGenotyping.huge_disk": 400
}

実行にあたっての注意

インターバルの定義ファイルが大きすぎるためエラーが発生するので、cromwell起動時に次のオプションを利用する
-Dsystem.input-read-limits.lines=500000
この指定に関する詳細は次を参照のこと Default limit on file input to read_lines is too small #2768

joint-discovery-gatk4.wdl はGoogle Cloudの外では動作しない
詳細は次を参照のことImportGVCFs error message "Failed to create reader from file"

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.wdlの実行

システムに40 core存在するが、他のサービスの動作を妨げないため同時実行可能な処理の上限を30として実行する事とした
"genomicsdb"の登録部分では並列化の恩恵が大きい、しかしワークフローの他の部分ではシーケンシャル処理が多く、同時実行可能処理を増加させても劇的な速度向上は見込めない

$ java -Dbackend.providers.Local.config.concurrent-job-limit=30 -Dsystem.input-read-limits.lines=500000 -jar cromwell.jar run gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.wdl -i gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.json

処理結果

下記の通り結果が出力された(処理時間2:44)


{ 
  "outputs": {
    "JointGenotyping.DynamicallyCombineIntervals.output_intervals": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-DynamicallyCombineIntervals/execution/out.intervals",
    "JointGenotyping.CollectMetricsOnFullVcf.summary_metrics_file": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-CollectMetricsOnFullVcf/execution/NA12878.variant_calling_summary_metrics",
    "JointGenotyping.GatherMetrics.detail_metrics_file": null,
    "JointGenotyping.GatherMetrics.summary_metrics_file": null,
    "JointGenotyping.CollectMetricsOnFullVcf.detail_metrics_file": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-CollectMetricsOnFullVcf/execution/NA12878.variant_calling_detail_metrics",
    "JointGenotyping.FinalGatherVcf.output_vcf": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-FinalGatherVcf/execution/NA12878.vcf.gz",
    "JointGenotyping.FinalGatherVcf.output_vcf_index": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-FinalGatherVcf/execution/NA12878.vcf.gz.tbi"
  },
  "id": "9bb3e39d-df36-4479-9628-481e0ac12bad"
}

今回はここまで:smile:

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away