LoginSignup
2
3

More than 1 year has passed since last update.

Clara Parabricks4.0を利用する その1

Last updated at Posted at 2022-10-03

Clara Parabricks V4.0を利用する

Clara Parabricksって何?

nVidia社提供のGPUを利用してゲノム解析を高速に実行するためのパッケージ
3.xまでは年間ライセンス料が必要だったけど
4.x以降は基本無償で提供
サポートが必要な場合は"NVIDIA AI Enterprise ライセンス"を契約

A license is no longer required to use Clara Parabricks. The container works out of the box once downloaded. Users can run Parabricks without any limitations on:
Number of systems used
Number of hours used
Duration of use
Users who would like to have Enterprise Support for Clara Parabricks can purchase NVIDIA AI Enterprise licenses, which provides full-stack support. To learn more about NVIDIA AI Enterprise, please visit https://www.nvidia.com/en-us/data-center/products/ai-enterprise/.

主に、動的計画法(Dynamic Programming, DP)のアルゴリズムを利用した解析ツールをGPUを利用して高速化

  • illuminaのショートリードをBWAで参照配列にマッピング
  • GATKでのvariant検出

等が桁違いに高速化できる

Clara Parabricksを使うにはどうするの?

Clara Parabricksが動作する下記の条件を満たすシステムがあれば良い

ハードウエア

  • GPU
    • 2枚以上搭載
    • CUDA architectures の60,70,75,80のどれかをサポート
      • これはPascal,Volta,Turing,AmpereというGPUの設計を示している
    • GPUボードが16GB以上のメモリを備える
  • ホストPC
    • GPUが2枚の場合
      • メインメモリが100GB以上
      • 24スレッド以上実行可能なCPU
    • GPUが4枚の場合
      • メインメモリが196GB以上
      • 32スレッド以上実行可能なCPU
    • GPUが8枚の場合
      • メインメモリが392GB以上
      • 48スレッド以上実行可能なCPU

ソフトウェア

  • OS
    • Linux
  • nVidia Docker2が動作する環境
    • DockerのVersionは20.10以上
  • GPU
    • DriverのVersionは465.32以上

nVidia Dockerが動作しているかどうか確認

次のコマンドを実行

$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

以下はその出力例

Mon Oct  3 07:44:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:86:00.0 Off |                  Off |
| 30%   54C    P0    85W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:D8:00.0 Off |                  Off |
| 30%   53C    P0    89W / 300W |      0MiB / 49140MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Driver Version:が条件を満たしているかどうかを確認
もしDriver Version:465.32より前ならアップデートが必要

nVidia dockerがインストールされていない場合にはNVIDIA CONTAINER TOOLKITをインストール

Clara Parabricksのfq2bamツール

fq2bamは次世代シーケンサーで取得したリードを標準配列にマップするツールで次の3つの処理を行う

  • fastq形式のシーケンスリードデータをリファレンス配列にマッピング
  • Duplicate リードにフラグを設定したBamを作成
  • 既知のVariantの情報を利用して各塩基の精度情報を補正(BQSR)するためのデータを出力

解析のための配列データとして次のファイルを準備する

  • マッピング先のリファレンス配列とそのindex
  • 塩基配列のQuality値補正のための既知のVariantのvcf
  • 解析対象のfastqファイル

下記の実行例では

  • リファレンス配列とそのindex配列(ここでは解析を実行するディレクトリに配置) ${PWD}/reference
    • GCA_000001405.15_GRCh38_no_alt_analysis_set.fna 他一式
GCA_000001405.15_GRCh38_no_alt_analysis_set.dict     GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.ann  GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.pac
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna      GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bwt  GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.sa
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.amb  GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai
  • 既知のVariantのvcf(Broad InsituteのResource Bundleから取得)
    • Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
      次のコマンドで取得
    • ${PWD}/variant に配置
curl -LO https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
curl -LO https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
  • 解析対象のfastqが置いてあるディレクトリ
    • Parabricks_sample/Data

ここではペアエンドのfastqのリードが含まれるファイル名をsampleとして

$ export SAMPLE="sample"
  • Read1を${SAMPLE}_1.fq.gz
  • Read2を${SAMPLE}_2.fq.gz

以上を準備して下記のコマンドを実行

docker run \
    --gpus all \
    -u `id -u`:`id -g` \
    --rm \
    --volume ${PWD}/reference:/reference \
    --volume ${PWD}/variant:/variant \
    --volume ${PWD}/parabricks_sample/Data:/fastq \
    --volume ${PWD}/Parabricks4.0:/outputdir \
    --workdir /tmp \
    nvcr.io/nvidia/clara/clara-parabricks:4.0.0-1 \
    pbrun fq2bam \
    --ref /reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    --knownSites /variant/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --in-fq /fastq/${SAMPLE}_1.fq.gz /fastq/${SAMPLE}_2.fq.gz \
    --out-bam /outputdir/${SAMPLE}.bam \
    --out-recal-file /outputdir/${SAMPLE}.BQSR-report.txt

ここで利用しているsampleのリードの数は約53百万
マッピングはおよそ3分程で終了

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /fastq/sample_1.fq.gz and /fastq/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2022-Oct-03 07:47:46] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:47:46] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2022-Oct-03 07:47:46] ||                              Version 4.0.0-1                             ||
[PB Info 2022-Oct-03 07:47:46] ||                       GPU-BWA mem, Sorting Phase-I                       ||
[PB Info 2022-Oct-03 07:47:46] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Info 2022-Oct-03 07:47:54] GPU-BWA mem
[PB Info 2022-Oct-03 07:47:54] ProgressMeter    Reads           Base Pairs Aligned
[PB Info 2022-Oct-03 07:48:06] 5043564          570000000
[PB Info 2022-Oct-03 07:48:13] 10087128 1160000000
[PB Info 2022-Oct-03 07:48:20] 15130692 1740000000
[PB Info 2022-Oct-03 07:48:26] 20174256 2310000000
[PB Info 2022-Oct-03 07:48:33] 25217820 2900000000
[PB Info 2022-Oct-03 07:48:40] 30261384 3480000000
[PB Info 2022-Oct-03 07:48:48] 35304948 4060000000
[PB Info 2022-Oct-03 07:48:55] 40348512 4650000000
[PB Info 2022-Oct-03 07:49:01] 45392076 5230000000
[PB Info 2022-Oct-03 07:49:08] 50435640 5800000000
[PB Info 2022-Oct-03 07:49:17]
GPU-BWA Mem time: 83.020892 seconds
[PB Info 2022-Oct-03 07:49:17] GPU-BWA Mem is finished.


[main] CMD: /usr/local/parabricks/binaries//bin/bwa mem -Z ./pbOpts.txt /reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna /fastq/sample_1.fq.gz /fastq/sample_2.fq.gz 1
[main] Real time: 90.943 sec; CPU: 2110.174 sec
[PB Info 2022-Oct-03 07:49:17] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:17] ||        Program:                      GPU-BWA mem, Sorting Phase-I        ||
[PB Info 2022-Oct-03 07:49:17] ||        Version:                                           4.0.0-1        ||
[PB Info 2022-Oct-03 07:49:17] ||        Start Time:                       Mon Oct  3 07:47:46 2022        ||
[PB Info 2022-Oct-03 07:49:17] ||        End Time:                         Mon Oct  3 07:49:17 2022        ||
[PB Info 2022-Oct-03 07:49:17] ||        Total Time:                            1 minute 31 seconds        ||
[PB Info 2022-Oct-03 07:49:17] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:18] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:18] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2022-Oct-03 07:49:18] ||                              Version 4.0.0-1                             ||
[PB Info 2022-Oct-03 07:49:18] ||                             Sorting Phase-II                             ||
[PB Info 2022-Oct-03 07:49:18] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:18] progressMeter - Percentage
[PB Info 2022-Oct-03 07:49:18] 0.0       0.00 GB
[PB Info 2022-Oct-03 07:49:28] Sorting and Marking: 10.001 seconds
[PB Info 2022-Oct-03 07:49:28] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:28] ||        Program:                                  Sorting Phase-II        ||
[PB Info 2022-Oct-03 07:49:28] ||        Version:                                           4.0.0-1        ||
[PB Info 2022-Oct-03 07:49:28] ||        Start Time:                       Mon Oct  3 07:49:18 2022        ||
[PB Info 2022-Oct-03 07:49:28] ||        End Time:                         Mon Oct  3 07:49:28 2022        ||
[PB Info 2022-Oct-03 07:49:28] ||        Total Time:                                     10 seconds        ||
[PB Info 2022-Oct-03 07:49:28] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:28] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:28] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2022-Oct-03 07:49:28] ||                              Version 4.0.0-1                             ||
[PB Info 2022-Oct-03 07:49:28] ||                         Marking Duplicates, BQSR                         ||
[PB Info 2022-Oct-03 07:49:28] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:49:29] progressMeter -  Percentage
[PB Info 2022-Oct-03 07:49:39] 0.0       19.53 GB
[PB Info 2022-Oct-03 07:49:49] 0.0       19.53 GB
[PB Info 2022-Oct-03 07:49:59] 1.4       19.06 GB
[PB Info 2022-Oct-03 07:50:09] 33.4      13.02 GB
[PB Info 2022-Oct-03 07:50:19] 75.7      4.82 GB
[PB Info 2022-Oct-03 07:50:29] 100.0     0.00 GB
[PB Info 2022-Oct-03 07:50:29] BQSR and writing final BAM:  60.059 seconds
[PB Info 2022-Oct-03 07:50:29] ------------------------------------------------------------------------------
[PB Info 2022-Oct-03 07:50:29] ||        Program:                          Marking Duplicates, BQSR        ||
[PB Info 2022-Oct-03 07:50:29] ||        Version:                                           4.0.0-1        ||
[PB Info 2022-Oct-03 07:50:29] ||        Start Time:                       Mon Oct  3 07:49:28 2022        ||
[PB Info 2022-Oct-03 07:50:29] ||        End Time:                         Mon Oct  3 07:50:29 2022        ||
[PB Info 2022-Oct-03 07:50:29] ||        Total Time:                              1 minute 1 second        ||
[PB Info 2022-Oct-03 07:50:29] ------------------------------------------------------------------------------
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

GPUを利用したbwaでbamファイルとそのインデックスファイル等が作成できた

$ ls -a ${PWD}/Parabricks4.0
.  ..  sample.BQSR-report.txt  sample.bam  sample.bam.bai  sample_chrs.txt

今回はここまで:smile:

2
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
3