More than 5 years have passed since last update.

cromwell でjoint-discovery-gatk4-localを実行する

Last updated at 2019-02-05Posted at 2019-01-31

cwltools でjoint-discovery-gatk4-localを実行する

これは、下記の一連のステップの最終段階としてjoint-discovery-gatk4-localを処理するものである
cromwell を手元のOS Xにインストールする

 cromwell でGATKのワークフローgatk4-data-processing-1.0.1/processing-for-variant-discovery-gatk4.wdlを実行する
 cromwell を利用してWorkflows for germline short variant discovery with GATK4を実行する

ただし、この処理には100GBのメモリを要求するプロセスが存在するため、今回は下記のような構成のシステムを利用する

OS: Red Hat Enterprise Linux Server release 7.1
CPU: 40 cores
Memory: 1TB

CPU コア数とメモリ容量の確認

Cores

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.1 (Maipo)
$ grep processor /proc/cpuinfo | wc -l
40
$ grep 'MemTotal' /proc/meminfo
MemTotal:       1056749896 kB

手順概要

gatk4-germline-snps-indelsをダウンロード
- 以前にダウンロードしていればこの手順は不要
必要なファイルをgoogle storageから手元にダウンロード
- 下記の2つのファイルからユーティリティーdownload_gs.py**(V1.1.1以上)**を利用してファイルを取得する
- gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4.hg38.wgs.inputs.json に記載のgs://から始まるファイル
- gatk-test-data/joint_discovery/NA12878.sample_map　に記載のファイル
- gs://gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi というインデックスファイル
入力ファイルgatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.jsonを作成
- gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs.jsonを元に下記の点を変更
- フォルダーの構成をダウンロードした構成に合致するように変更

必要なファイルをgoogle storageから手元にダウンロード

計算に必要なファイルをgoogle storageからダウンロードするために、download_gs.pyを利用する

この際に利用するのはgatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4.hg38.wgs.inputs.json

$ python3 download_gs.py gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4.hg38.wgs.inputs.json

上記で取得したgatk-test-data/joint_discovery/NA12878.sample_map中のデータファイルを取得する

$ python3 download_gs.py gatk-test-data/joint_discovery/NA12878.sample_map --non_json

gs://gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi のダウンロード

更にNA12878.sample_mapに含まれるNA12878.g.vcf.gzのインデックスファイルNA12878.g.vcf.gz.tbiを取得しておく

$ echo "gs://gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi" > index.txt
$ python3 download_gs.py index.txt --non_json

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.jsonの作成

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs.jsonを元に処理を行うが、先程入手したテストデータのフォルダー構造とは異なっているので、詳細部分の調整を行う必要がある
ファイルのフォルダ指定を調整したファイルgatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.jsonを作成した

変更点が多いため結果のファイルの全体を示す

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.json

{
  "##_COMMENT1": "INPUT GVCFs & COHORT -- DATASET-SPECIFC, MUST BE ADAPTED",
  "JointGenotyping.sample_names": ["NA12878"],
  "JointGenotyping.input_gvcfs": ["gatk-test-data/joint_discovery/NA12878.g.vcf.gz"],
  "JointGenotyping.input_gvcfs_indices": ["gatk-test-data/joint_discovery/NA12878.g.vcf.gz.tbi"],
  "JointGenotyping.callset_name": "NA12878",

  "##_COMMENT2": "REFERENCE FILES",
  "JointGenotyping.ref_fasta": "broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
  "JointGenotyping.ref_fasta_index": "broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
  "JointGenotyping.ref_dict": "broad-references/hg38/v0/Homo_sapiens_assembly38.dict",

  "##_COMMENT3": "INTERVALS",
  "JointGenotyping.eval_interval_list": "broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",
  "JointGenotyping.unpadded_intervals_file": "gatk-test-data/intervals/hg38.even.handcurated.20k.intervals",

  "##_COMMENT4": "RESOURCE FILES",
  "JointGenotyping.dbsnp_vcf": "broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
  "JointGenotyping.dbsnp_vcf_index": "broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
  "JointGenotyping.one_thousand_genomes_resource_vcf": "broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
  "JointGenotyping.one_thousand_genomes_resource_vcf_index": "broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi",
  "JointGenotyping.omni_resource_vcf": "broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz",
  "JointGenotyping.omni_resource_vcf_index": "broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi",
  "JointGenotyping.mills_resource_vcf": "broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
  "JointGenotyping.mills_resource_vcf_index": "broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
  "JointGenotyping.axiomPoly_resource_vcf": "broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz",
  "JointGenotyping.axiomPoly_resource_vcf_index": "broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.tbi",
  "JointGenotyping.hapmap_resource_vcf": "broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz",
  "JointGenotyping.hapmap_resource_vcf_index": "broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi",

  "##_COMMENT5": "VQSR PARAMETERS",
  "JointGenotyping.SNP_VQSR_downsampleFactor": 10,
  "JointGenotyping.snp_filter_level": 99.7,
  "JointGenotyping.indel_filter_level": 99.7,
  "JointGenotyping.indel_recalibration_annotation_values": ["FS", "ReadPosRankSum", "MQRankSum", "QD", "SOR", "DP"],
  "JointGenotyping.indel_recalibration_tranche_values": ["100.0", "99.95", "99.9", "99.5", "99.0", "97.0", "96.0", "95.0", "94.0", "93.5", "93.0", "92.0", "91.0", "90.0"],
  "JointGenotyping.snp_recalibration_tranche_values": ["100.0", "99.95", "99.9", "99.8", "99.6", "99.5", "99.4", "99.3", "99.0", "98.0", "97.0", "90.0" ],
  "JointGenotyping.snp_recalibration_annotation_values": ["QD", "MQRankSum", "ReadPosRankSum", "FS", "MQ", "SOR", "DP"],

  "##_COMMENT4": "DOCKERS",
  "JointGenotyping.python_docker": "python:2.7",
  "JointGenotyping.gatk_docker": "broadinstitute/gatk:4.0.6.0",

  "##_COMMENT5": "PATHS",
  "JointGenotyping.gatk_path": "/gatk/gatk",

  "##_COMMENT8": "DISK SIZE ALLOCATION",
  "JointGenotyping.small_disk": 100,
  "JointGenotyping.medium_disk": 200,
  "JointGenotyping.large_disk": 300,
  "JointGenotyping.huge_disk": 400
}

実行にあたっての注意

インターバルの定義ファイルが大きすぎるためエラーが発生するので、cromwell起動時に次のオプションを利用する
-Dsystem.input-read-limits.lines=500000
この指定に関する詳細は次を参照のこと Default limit on file input to read_lines is too small #2768

joint-discovery-gatk4.wdl はGoogle Cloudの外では動作しない
詳細は次を参照のことImportGVCFs error message "Failed to create reader from file"

gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.wdlの実行

システムに40 core存在するが、他のサービスの動作を妨げないため同時実行可能な処理の上限を30として実行する事とした
"genomicsdb"の登録部分では並列化の恩恵が大きい、しかしワークフローの他の部分ではシーケンシャル処理が多く、同時実行可能処理を増加させても劇的な速度向上は見込めない

$ java -Dbackend.providers.Local.config.concurrent-job-limit=30 -Dsystem.input-read-limits.lines=500000 -jar cromwell.jar run gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.wdl -i gatk4-germline-snps-indels-1.0.4/joint-discovery-gatk4-local.hg38.wgs.inputs_local.json

処理結果

下記の通り結果が出力された(処理時間2:44)


{ 
  "outputs": {
    "JointGenotyping.DynamicallyCombineIntervals.output_intervals": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-DynamicallyCombineIntervals/execution/out.intervals",
    "JointGenotyping.CollectMetricsOnFullVcf.summary_metrics_file": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-CollectMetricsOnFullVcf/execution/NA12878.variant_calling_summary_metrics",
    "JointGenotyping.GatherMetrics.detail_metrics_file": null,
    "JointGenotyping.GatherMetrics.summary_metrics_file": null,
    "JointGenotyping.CollectMetricsOnFullVcf.detail_metrics_file": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-CollectMetricsOnFullVcf/execution/NA12878.variant_calling_detail_metrics",
    "JointGenotyping.FinalGatherVcf.output_vcf": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-FinalGatherVcf/execution/NA12878.vcf.gz",
    "JointGenotyping.FinalGatherVcf.output_vcf_index": "/data/workdir/xxx/cromwell/cromwell-executions/JointGenotyping/9bb3e39d-df36-4479-9628-481e0ac12bad/call-FinalGatherVcf/execution/NA12878.vcf.gz.tbi"
  },
  "id": "9bb3e39d-df36-4479-9628-481e0ac12bad"
}

今回はここまで

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up