More than 5 years have passed since last update.

cromwell を利用してWorkflows for germline short variant discovery with GATK4を実行する

Last updated at 2019-02-05Posted at 2019-01-31

Cromwellを利用してgatk4-germline-snps-indelsを実行する

https://github.com/gatk-workflows/gatk4-germline-snps-indels
を実行する

このワークフローを実行している環境は

OS: macOS Mojave Version 10.14.2
CPU: Core i5 (4 Core)
Memory: 32GB

下記の記述内容の続き
cromwell でGATKのワークフローgatk4-data-processing-1.0.1/processing-for-variant-discovery-gatk4.wdlを実行する

手順概要

gatk4-germline-snps-indelsをダウンロード
必要なファイルをgoogle storageから手元にダウンロード
- 下記の2つのファイルからユーティリティーdownload_gs.pyを利用してファイルを取得する
- haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json に記載のgs://から始まるファイル
- gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt　に記載のファイル
入力ファイルを修正
- gs://部分を加工
- haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json
  - 手元で実行するため、小さい方のbamファイルを選択する
  - gvcfを作成する
ワークフローの修正
- 実行時にreferenceファイルのサイズから動的にディスク容量を計算する部分をコメントアウト

gatk4-germline-snps-indelsをダウンロード

https://github.com/gatk-workflows/gatk4-germline-snps-indels/releases
Source Code (zip)をダブルクリックしてダウンロード

gatk4-germline-snps-indels-1.0.4

$ mv ~/Downloads/gatk4-germline-snps-indels-1.0.4 .

必要なファイルをgoogle storageから手元にダウンロード

$ python3 download_gs.py gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json

ダウンロードしたgatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txtには解析領域を示す一連のファイルが記載されているので、このファイルに記載されている一連のファイルもダウンロードする

$ python3 download_gs.py gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt --non_json

入力ファイルを修正

gs://を次のファイルから取り除いて
gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json

gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs_local.json
を作成する

$ cat gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json | sed 's/gs:\/\///g' > gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs_local.json

gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txtの内部に記載されているファイルについても同様

元のファイルをgatk-test-data/intervals/hg38_wgs_scattered_calling_intervals_bak.txtにリネームして、ファイルの記述からgs://を除いたファイルをgatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txtとして作成する

$ mv gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt  gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals_bak.txt 
$ cat gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals_bak.txt | sed 's/gs:\/\///g' > gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt

入力用jsonファイルの変更

gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs_local.jsonファイルのサンプルbamはお試し用には大きすぎるので

  "##_COMMENT1": "INPUT BAM",
  "#HaplotypeCallerGvcf_GATK4.input_bam": "gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam",
  "#HaplotypeCallerGvcf_GATK4.input_bam_index": "gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bai",
  "HaplotypeCallerGvcf_GATK4.input_bam": "broad-public-datasets/NA12878/NA12878.cram",
  "HaplotypeCallerGvcf_GATK4.input_bam_index": "broad-public-datasets/NA12878/NA12878.cram.crai",

小さい方のサンプルを利用するために、前2行のコメントを外し、後2行をコメントアウトする

  "##_COMMENT1": "INPUT BAM",
  "HaplotypeCallerGvcf_GATK4.input_bam": "gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam",
  "HaplotypeCallerGvcf_GATK4.input_bam_index": "gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bai",
  "#HaplotypeCallerGvcf_GATK4.input_bam": "broad-public-datasets/NA12878/NA12878.cram",
  "#HaplotypeCallerGvcf_GATK4.input_bam_index": "broad-public-datasets/NA12878/NA12878.cram.crai",

次の#HaplotypeCallerGvcf_GATK4.HaplotypeCaller.make_gvcfのコメントを外す

  "##_COMMENT4": "MISCELLANEOUS PARAMETERS",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.make_gvcf": "True",

java -Dbackend.providers.Local.config.concurrent-job-limit=2 -jar cromwell.jar run  gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.wdl -i gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs_local.json

しかし、このままでは動作しない

 [error] WorkflowManagerActor Workflow 30e9fce5-4ef2-4974-abf7-fa7496cb7022 failed (during ExecutingWorkflowState): cromwell.engine.workflow.lifecycle.execution.job.preparation.JobPreparationActor$$anonfun$1$$anon$1: Call input and runtime attributes evaluation failed for HaplotypeCaller:
Failed to evaluate input 'disk_size' (reason 1 of 1): [Attempted 1 time(s)] - NoSuchFileException: /Users/xxxx/Documents/cromwell/cromwell-executions/HaplotypeCallerGvcf_GATK4/30e9fce5-4ef2-4974-abf7-fa7496cb7022/call-HaplotypeCaller/shard-14/execution/gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam
Failed to evaluate input 'ref_size' (reason 1 of 1): [Attempted 1 time(s)] - NoSuchFileException: /Users/xxxx/Documents/cromwell/cromwell-executions/HaplotypeCallerGvcf_GATK4/30e9fce5-4ef2-4974-abf7-fa7496cb7022/call-HaplotypeCaller/shard-14/execution/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
	at cromwell.engine.workflow.lifecycle.execution.job.preparation.JobPreparationActor$$anonfun$1.applyOrElse(JobPreparationActor.scala:66)
	at cromwell.engine.workflow.lifecycle.execution.job.preparation.JobPreparationActor$$anonfun$1.applyOrElse(JobPreparationActor.scala:62)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
	at akka.actor.FSM.processEvent(FSM.scala:684)
	at akka.actor.FSM.processEvent$(FSM.scala:681)
	at cromwell.engine.workflow.lifecycle.execution.job.preparation.JobPreparationActor.processEvent(JobPreparationActor.scala:39)
	at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:678)
	at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:672)
	at akka.actor.Actor.aroundReceive(Actor.scala:517)
	at akka.actor.Actor.aroundReceive$(Actor.scala:515)
	at cromwell.engine.workflow.lifecycle.execution.job.preparation.JobPreparationActor.aroundReceive(JobPreparationActor.scala:39)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
	at akka.actor.ActorCell.invoke(ActorCell.scala:557)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
	at akka.dispatch.Mailbox.run(Mailbox.scala:225)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

ワークフローの修正

wdl内部で必要なディスク容量を計算するためにリファレンスファイルのサイズを求めているが、inputの指定の段階では該当のファイルがコピーされていないためにNoSuchFileException:が発生する
結局wdlファイルに手を入れない限り、このワークフローを動作させられない
そこで、該当領域をコメントアウトしたwdlを作成する

変更箇所について

gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.wdlを修正してgatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4_local.wdlを作成する

103行目の
# TASK DEFINITIONS 以降の一部をコメントアウトしている
オリジナルのwdlの該当行以降を下記で置き換えてgatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4_local.wdlを作成

gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4_local.wdlの一部

# TASK DEFINITIONS

task CramToBamTask {
  # Command parameters
  File ref_fasta
  File ref_fasta_index
  File ref_dict
  File input_cram
  String sample_name

  # Runtime parameters
  String docker
  Int? machine_mem_gb
  Int? disk_space_gb
  Boolean use_ssd = false
  Int? preemptible_attempts

  #Float output_bam_size = size(input_cram, "GB") / 0.60
  #Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
  #Int disk_size = ceil(size(input_cram, "GB") + output_bam_size + ref_size) + 20

  command {
    set -e
    set -o pipefail

    samtools view -h -T ${ref_fasta} ${input_cram} |
    samtools view -b -o ${sample_name}.bam -
    samtools index -b ${sample_name}.bam
    mv ${sample_name}.bam.bai ${sample_name}.bai
  }
  runtime {
    docker: docker
    #memory: select_first([machine_mem_gb, 15]) + " GB"
    #disks: "local-disk " + select_first([disk_space_gb, disk_size]) + if use_ssd then " SSD" else " HDD"
    preemptibe: preemptible_attempts
 }
  output {
    File output_bam = "${sample_name}.bam"
    File output_bai = "${sample_name}.bai"
  }
}

# HaplotypeCaller per-sample in GVCF mode
task HaplotypeCaller {
  File input_bam
  File input_bam_index
  File interval_list
  String output_filename
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  Float? contamination
  Boolean make_gvcf

  String gatk_path
  String? java_options
  String java_opt = select_first([java_options, "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10"])

  # Runtime parameters
  String docker
  Int? mem_gb
  Int? disk_space_gb
  Boolean use_ssd = false
  Int? preemptible_attempts

  Int machine_mem_gb = select_first([mem_gb, 7])
  Int command_mem_gb = machine_mem_gb - 1

  #Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
  #Int disk_size = ceil(size(input_bam, "GB") + ref_size) + 20

  command <<<
  set -e
  
    ${gatk_path} --java-options "-Xmx${command_mem_gb}G ${java_opt}" \
      HaplotypeCaller \
      -R ${ref_fasta} \
      -I ${input_bam} \
      -L ${interval_list} \
      -O ${output_filename} \
      -contamination ${default=0 contamination} ${true="-ERC GVCF" false="" make_gvcf}
  >>>

  runtime {
    docker: docker
    #memory: machine_mem_gb + " GB"
    #disks: "local-disk " + select_first([disk_space_gb, disk_size]) + if use_ssd then " SSD" else " HDD"
    preemptible: select_first([preemptible_attempts, 3])
  }

  output {
    File output_vcf = "${output_filename}"
    File output_vcf_index = "${output_filename}.tbi"
  }
}
# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
  Array[File] input_vcfs
  Array[File] input_vcfs_indexes
  String output_filename

  String gatk_path

  # Runtime parameters
  String docker
  Int? mem_gb
  Int? disk_space_gb
  Boolean use_ssd = false
  Int? preemptible_attempts

  Int machine_mem_gb = select_first([mem_gb, 3])
  Int command_mem_gb = machine_mem_gb - 1

  command <<<
  set -e

    ${gatk_path} --java-options "-Xmx${command_mem_gb}G"  \
      MergeVcfs \
      --INPUT ${sep=' --INPUT ' input_vcfs} \
      --OUTPUT ${output_filename}
  >>>

  runtime {
    docker: docker
    memory: machine_mem_gb + " GB"
    #disks: "local-disk " + select_first([disk_space_gb, 100]) + if use_ssd then " SSD" else " HDD"
    preemptible: select_first([preemptible_attempts, 3])
  }


  output {
    File output_vcf = "${output_filename}"
    File output_vcf_index = "${output_filename}.tbi"
  }
}

ワークフローの実行

$ java -Dbackend.providers.Local.config.concurrent-job-limit=2 -jar cromwell.jar run  gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4_local.wdl -i gatk4-germline-snps-indels-1.0.4/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs_local.json

出力ファイル

{
  "outputs": {
    "HaplotypeCallerGvcf_GATK4.output_vcf": "/Users/xxxx/Documents/cromwell/cromwell-executions/HaplotypeCallerGvcf_GATK4/c87278d5-f092-4179-86d9-3bf5a6b04a11/call-MergeGVCFs/execution/NA12878_24RG_small.hg38.g.vcf.gz",
    "HaplotypeCallerGvcf_GATK4.output_vcf_index": "/Users/xxxx/Documents/cromwell/cromwell-executions/HaplotypeCallerGvcf_GATK4/c87278d5-f092-4179-86d9-3bf5a6b04a11/call-MergeGVCFs/execution/NA12878_24RG_small.hg38.g.vcf.gz.tbi"
  },
  "id": "c87278d5-f092-4179-86d9-3bf5a6b04a11"
}

今回はここまで

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up