More than 5 years have passed since last update.

GATKを用いた変異解析（SNV, short INDELの検出）のベストプラクティス

Posted at 2018-09-27

はじめに

Broad Instituteによって開発されているGATK(Genome Analysis Toolkit)を使用して変異を検出する。

インプット：生データ（Fastqファイル）
アウトプット：　変異検出結果（VCFファイル）

流れ

参照配列へのマッピング
Duplicatesリードの除去
塩基スコアの再計算
１サンプルごとの変異検出
複数サンプルの変異検出結果の統合
変異スコアの算出

1. 参照配列へのマッピング

マッピングにはBWAを使用する。
事前にインデックスファイルを作成しておく。

bwa index [リファレンス配列]

マッピングをする。

bwa mem -M -R '@RG\tID:[リードグループ]\tSM:[サンプル名]\tPL:[シーケンサー]' [リファレンス配列] [リード１] [リード２] > aln-pe.sam

１つのライブラリーを１つのレーンで読んでいれば、[リードグループ]はサンプル名と同じで良い。

2. Duplicatesリードの除去

まず、マッピング結果をsortする。

samtools sort -o aln-pe-sorted.bam -O BAM aln-pe.sam

Duplicatesリードを除去する（実際には除去せず、印を付けるだけ）。

gatk MarkDuplicates \
  -I aln-pe-sorted.bam \
  -M metrics.txt \
  -O aln-pe_MarkDup.bam

3. 塩基スコアの再計算

既知変異データをもとに、塩基スコアを再計算する。
推奨の既知変異データについては以下を参照。
https://software.broadinstitute.org/gatk/documentation/article.php?id=1247
ダウンロードは以下からできる。
https://software.broadinstitute.org/gatk/download/bundle

gatk BaseRecalibrator \
  -I aln-pe_MarkDup.bam \
  --known-sites [dbsnp] \
  --known-sites [golden_indel] \
  -O recal_data.table \
  -R [リファレンス配列]

gatk ApplyBQSR \
  -I aln-pe_MarkDup.bam \
  -bqsr recal_data.table \
  -O aln-pe_bqsr.bam

4. １サンプルごとの変異検出

gatk HaplotypeCaller \
  -I aln-pe_bqsr.bam \
  -O haplotypecaller.g.vcf \
  --emit-ref-confidence GVCF \
  -R [リファレンス配列]

5. 複数サンプルの変異検出結果の統合

--variantで複数サンプルを指定する。

gatk GenotypeGVCFs \
　　--variant haplotypecaller1.g.vcf \
　　--variant haplotypecaller2.g.vcf \
　　--variant haplotypecaller3.g.vcf \
　　-R [リファレンス配列] \
　　-O haplotypecaller.vcf

6. 変異スコアの算出

既知変異データをもとに、変異スコアを算出する。
詳細は以下を参照。
https://software.broadinstitute.org/gatk/documentation/article.php?id=2805
必要なデータは以下からダウンロードできる。
https://software.broadinstitute.org/gatk/download/bundle

変異スコアの計算（SNP）

gatk VariantRecalibrator \
  -V haplotypecaller.vcf \
  -R [リファレンス配列] \
  -resource hapmap,known=false,training=true,truth=true,prior=15.0:[hapmap] \
  -resource omni,known=false,training=true,truth=true,prior=12.0:[omni] \
  -resource 1000G,known=false,training=true,truth=false,prior=10.0:[1000G] \
  -resource dbsnp,known=true,training=false,truth=false,prior=2.0:[dbsnp] \
  -an DP \
  -an QD \
  -an FS \
  -an SOR \
  -an MQ \
  -an MQRankSum \
  -an ReadPosRankSum \
  -mode SNP \
  -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
  --max-gaussians 8 \
  -O recalibrate_SNP.recal \
  --tranches-file recalibrate_SNP.tranches

変異スコアの適用(SNP)

gatk ApplyVQSR \
  -V haplotypecaller.vcf \
  --recal-file recalibrate_SNP.recal \
  --tranches-file recalibrate_SNP.tranches \
  -mode SNP \
  -ts-filter-level 99.0 \
  -O recalibrated_snps_raw_indels.vcf

変異スコアの計算（INDEL）

gatk VariantRecalibrator \
  -V $recalibrated_snps_raw_indels \
  -R $reference \
  --resource mills,known=false,training=true,truth=true,prior=12.0:[golden_indel] \
  --resource dbsnp,known=true,training=false,truth=false,prior=2.0:[dbsnp] \
  -an QD \
  -an DP \
  -an FS \
  -an SOR \
  -an MQRankSum \
  -an ReadPosRankSum \
  -mode INDEL \
  -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
  --max-gaussians 4 \
  -O recalibrate_INDEL.recal \
  --tranches-file recalibrate_INDEL.tranches

変異スコアの適用(INDEL)

gatk ApplyVQSR \
  -V recalibrated_snps_raw_indels.vcf \
  --recal-file recalibrate_INDEL.recal \
  --tranches-file recalibrate_INDEL.tranches \
  -mode INDEL \
  -ts-filter-level 99.0 \
  -O recalibrated_variants.vcf

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up