More than 5 years have passed since last update.

PyTorchでのprofilerの使い方

Last updated at 2019-12-05Posted at 2019-08-28

はじめに

ここでは、PyTorchで提供されているプロファイラの取り方について説明する。このため、CUDA関数のプロファイルの取り方等は、別記事を参考してほしい。

PyTorchコードに対してプロファイルを外部からとる

PyTorchスクリプトに対して、そのままプロファイルを取ることが出来る。Google Colaboratoryでも%%writefileで書き込めば、そのファイルに対して実行出来る。

python -m torch.utils.bottleneck /path/to/source/script.py [args]

以下の2種類のプロファイルを取得できる。autograd.profilerは、後で出てくるのでここでは述べない。

cProfiler
torch.autograd.profiler.profiler

cProfilerは、Pythonから呼び出した関数毎に記録する仕組みである。なお、同じ名前の関数は、呼び出し回数と処理時間の加算だけで、まとめる。このため、呼出しごとにイベントを作成するautograd profilerよりメモリ消費量が小さい。
余談だが、cProfileそのものは、以下で取ることもできる。ソートオプションは、なくても良い。

python -m cProfile  -s tottime /path/to/source/script.py [args]

　参考資料

TORCH.UTILS.BOTTLENECK
ソースコード
- torch.utils.bottleneck　
cProfileは、PythonインタープリターのCのコードで取っている。このため、Python関数の処理実行時間はわかるものの、逆伝搬関数は、backwardメソッドの延長でまとめて呼ばれるので個別の時間はわからない。なお、コードの入り口は、以下を参照。(ソースコードメモ)Pythonのモジュール実行
- Python プロファイラ
  - Python関数を定義しているソースコード

PyTorchコード内でのプロファイルの取り方

torch.autograd.profiler.profileで順伝搬および逆伝搬のプロファイルを取ることが出来る。このため、cProfileに比べて、逆伝搬のプロファイルをより詳細に取得することが可能になる。

なお、以下のself_cpu_time_totalは、1.1.0から加えられたオプションである。それ以外の指定では、cpu_time_total等を使うのが良い。
また、1.2.0からは、テンソルシェープも取れるようになった。

import torch
x = torch.randn((1, 1), requires_grad=True)
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for _ in range(100):  # any normal python code, really!
      y = x ** 2 
      y.backward()
print(prof)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

なお、一つ一つのプロファイルを、表示する場合profオブジェクトをそのまま出力する。
そして、関数毎にまとめたい場合、prof.key_averages().table()を用いて出力する。
profオブジェクトの出力例は以下である。(関数毎表示であり長いので頭の部分だけ表示)

-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                 Self CPU total %   Self CPU total      CPU total %        CPU total     CPU time avg     CUDA total %       CUDA total    CUDA time avg  Number of Calls
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
pow                                           23.43%         10.363ms           23.43%         10.363ms         10.363ms           23.49%         10.359ms         10.359ms                1
torch::autograd::GraphRoot                     0.03%         12.590us            0.03%         12.590us         12.590us            0.01%          4.096us          4.096us                1
PowBackward0                                  29.11%         12.876ms           29.11%         12.876ms         12.876ms           29.20%         12.878ms         12.878ms                1
pow                                           14.84%          6.566ms           14.84%          6.566ms          6.566ms           14.90%          6.571ms          6.571ms                1

prof.key_averages().table(sort_by="self_cpu_time_total")の出力例は以下である。なお、self_cpu_time_totalは、1.1.0から加えられたオプションであることに注意。(このため、1.1.0以降でないと動かない。)

-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                 Self CPU total %   Self CPU total      CPU total %        CPU total     CPU time avg     CUDA total %       CUDA total    CUDA time avg  Number of Calls
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
torch::autograd::GraphRoot                    30.61%          2.595ms           30.61%          2.595ms         25.949us           22.90%          2.478ms         24.781us              100
pow                                           28.56%          2.422ms           28.56%          2.422ms         12.108us           20.32%          2.199ms         10.993us              200
mul                                           20.93%          1.775ms           20.93%          1.775ms          8.873us           15.90%          1.720ms          8.601us              200
PowBackward0                                  14.19%          1.203ms           45.82%          3.885ms         38.846us           36.37%          3.936ms         39.356us              100
torch::autograd::AccumulateGrad                5.71%        484.264us            5.71%        484.264us          4.843us            4.52%        488.902us          4.889us              100
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 8.479ms
CUDA time total: 10.821ms

仕組み

仕組みは、(Pythonで)torch.autograd.profilerから、(C++のコードである)torch::autograd::profiler名前空間にあるstate変数にプロファイルモードを指定(enableProfiler/disableProfiler)し、取得することにより実現している。profiler classの__enter__および__exit__メソッドでわかる通り、稼働中はイベントをメモリに保持し、終了時にPython側のprofileクラスのfunction_events属性に出力する。
なお、イベントの構造体は、(C++の)RangeEventListであり、recordメソッドで記録している。

参考資料

torch.autograd.profiler.profile
- [key_averageメソッド] (https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.profile.key_averages)
- tableメソッド
ソースコード
- Python側
  - class profile(object)
    - contextlib前提なので、__init__, __enter__, __exit__の3つのメソッドがポイント
- C++側
  - torch.autograd.profilerのC++へのマッピング
  - プロファイル状態設定パラメータ
  - struct RangeEventList(イベントを保持する構造体)
  - getTime()
    - CPU側は、CLOCK_MONOTONIC等で計測している。CUDA側は、cudaEventベース
関連ソフトウェア資料
- (CUDA)5.5. Event Management
  - CUDA側は、cudaEventSynchronize関数等で計測している。
- (Python)time — Time access and conversions

ほかのフレームワークでの測定方法

TensorFlowの場合

以下のように、デバイス(GPU/CPU)を指定して作成する。

%%writefile topk.py
import tensorflow as tf
from tensorflow.python.client import timeline

tf.set_random_seed(0)
with tf.device("/GPU:0"):
    x = tf.random_normal([32*8732])
    y, idxes = tf.nn.top_k(x, k=8732*32)

run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
    
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    sess.run(y,
                         options=run_options,
                         run_metadata=run_metadata) 

step_stats = run_metadata.step_stats
tl = timeline.Timeline(step_stats)

# Tensorの利用メモリもshow_memory=Trueでとれるが割愛
ctf = tl.generate_chrome_trace_format(show_memory=False,
                                      show_dataflow=True)

with open("timeline.json", "w") as f:
  f.write(ctf)

nvprofと一緒にプロファイルをとる。nvprofは、libcudartレベルの測定である。一方timelineは、TensorFlowのOpレベルの測定である。このためフレームワークのオーバヘッドが余分にかかる。

!nvprof python3 topk.py

プロファイルデータを出力する。なお、まとめてみたい場合、grepを使うと良いかもしれない。

!cat timeline.json | grep -e name -e dur

ここで、durが、マイクロ秒単位の時間である。

        {
            "ph": "X",
            "cat": "Op",
            "name": "TopKV2",
            "pid": 1,
            "tid": 0,
            "ts": 1560516999982492,
            "dur": 1315,
            "args": {
                "name": "TopKV2",
                "op": "TopKV2",
                "input0": "random_normal/RandomStandardNormal",
                "input1": "TopKV2/k"
            }
        },

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up