Pythonスクリプトのチューニングをしないといけない機運が高まってきたので、オライリーから出ている「High Performance Python」を読んだので、そのコードで気になったものを実行してみた。

この本のコードはこちら。
https://github.com/mynameisfiber/high_performance_python

また、今回の実験用に作成したコードはここに置いた。
https://github.com/shu65/diffusion_python

今回の実行環境

OS: Ubuntu 16.04 LTS (Windows10マシン上のVirtual Boxで起動)
ホストCPU: Intel Core i7-7660 CPU @ 2.50GHz
VMのメモリ: 10 GB
Python: 3.6.2

チューニングをするのにVMを使うのは微妙だが、ぱっと使えるのがなかったので妥協。今回は本に書いてあるものを動かして、その効果を実感できれば良いというスタンスなので。

6章行列とベクトルの計算のコード実行

HPCでお馴染みな拡散方程式をお題にした最適なの章で、チューニングのお題としてなじみ深かったので、手元でも確認。

まず、6.2にあるdiffusion_python.pyをcProfileを使ってプロファイルを取った。
本のコードはpython2.7用なので、これをpython 3.6で動くようにして、以下のようにコマンドを叩いてprofileを取得した。

python -m cProfile -s cumulative diffusion_python.py

このときの実行結果は以下の通り

         1007 function calls in 194.996 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  194.996  194.996 {built-in method builtins.exec}
        1    0.001    0.001  194.996  194.996 diffusion_python.py:3(<module>)
        1    1.968    1.968  194.995  194.995 diffusion_python.py:22(run_experiment)
      500  192.495    0.385  193.027    0.386 diffusion_python.py:8(evolve)
      500    0.532    0.001    0.532    0.001 diffusion_python.py:10(<listcomp>)
        1    0.001    0.001    0.001    0.001 diffusion_python.py:25(<listcomp>)
        2    0.000    0.000    0.000    0.000 {built-in method time.time}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

このプロファイル結果とコードを見ると、evolve()が遅いことがわかる。
次に、line_profilerを使ってevolve()のどこの行が遅いのか調べる。
evolve()に@profileというデコレータを付けて、イテレーション数を10に抑えて、以下のようにコマンドを叩いて行ごとのプロファイル結果を取得。

kernprof -l -v diffusion_python_with_line_profiler.py

この時の結果は以下の通り。

Wrote profile results to diffusion_python_with_line_profiler.py.lprof
Timer unit: 1e-06 s

Total time: 12.2728 s
File: diffusion_python_with_line_profiler.py
Function: evolve at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def evolve(grid, dt, D=1.0):
    10        10           22      2.2      0.0      xmax, ymax = grid_shape
    11        10        10976   1097.6      0.1      new_grid = [[0.0, ] * ymax for _ in range(xmax)]
    12      5130         3533      0.7      0.0      for i in range(xmax):
    13   2626560      1947147      0.7     15.9          for j in range(ymax):
    14                                                       grid_xx = grid[(i + 1) %
    15                                                                      xmax][j] + grid[(i - 1) %
    16   2621440      3823102      1.5     31.2                                             xmax][j] - 2.0 * grid[i][j]
    17                                                       grid_yy = grid[i][(j + 1) % ymax] + \
    18   2621440      3678664      1.4     30.0                  grid[i][(j - 1) % ymax] - 2.0 * grid[i][j]
    19   2621440      2809329      1.1     22.9              new_grid[i][j] = grid[i][j] + D * (grid_xx + grid_yy) * dt
    20        10            7      0.7      0.0      return grid

ちなみに、イテレーション数が500のままだと時間がかかるので注意。
この結果から、13から19行目が遅いことがわかる。
本では11行目も遅いことになっているが、今回は遅くなっていない。Pythonのバージョン違いが原因かな？

ただ、本にならって、まず11行目の部分の最適化のために、new_gridを使いまわすように変更する。コマンドは以下の通り。

$ kernprof -l -v diffusion_python_memory_with_line_profiler.py

この時の結果は以下の通り。

Wrote profile results to diffusion_python_memory_with_line_profiler.py.lprof
Timer unit: 1e-06 s

Total time: 9.96694 s
File: diffusion_python_memory_with_line_profiler.py
Function: evolve at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def evolve(grid, dt, out, D=1.0):
    10        10           11      1.1      0.0      xmax, ymax = grid_shape
    11      5130        23257      4.5      0.2      for i in range(xmax):
    12   2626560      1490204      0.6     15.0          for j in range(ymax):
    13                                                       grid_xx = grid[(i + 1) %
    14                                                                      xmax][j] + grid[(i - 1) %
    15   2621440      2974338      1.1     29.8                                             xmax][j] - 2.0 * grid[i][j]
    16                                                       grid_yy = grid[i][(j + 1) % ymax] + \
    17   2621440      3215771      1.2     32.3                  grid[i][(j - 1) % ymax] - 2.0 * grid[i][j]
    18   2621440      2263361      0.9     22.7              out[i][j] = grid[i][j] + D * (grid_xx + grid_yy) * dt

本のときよりも時間がかかっていない部分の改良なので、それほど大きく変化はしていない。

次に、メモリの断片化を抑える工夫を行う。本来は本にならってperfを使ってキャッシュミスがどれくらい起きているのかを確認してからやりたいが、VirtualBox上のUbuntuだとキャッシュミスを取得できそうになかったので、今回はあきらめた。

とりあえず、メモリ断片化を抑えるためにnumpyを使ったコードを作りテスト。
まずは500イテレーションでcProfileで測定した結果

         63888 function calls (61836 primitive calls) in 2.333 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    125/1    0.005    0.000    2.333    2.333 {built-in method builtins.exec}
        1    0.000    0.000    2.333    2.333 diffusion_python_numpy.py:3(<module>)
        1    0.038    0.038    2.215    2.215 diffusion_python_numpy.py:18(run_experiment)
      500    0.382    0.001    2.177    0.004 diffusion_python_numpy.py:14(evolve)
      500    1.071    0.002    1.795    0.004 diffusion_python_numpy.py:9(laplacian)
     2000    0.708    0.000    0.724    0.000 numeric.py:1399(roll)
        5    0.001    0.000    0.176    0.035 __init__.py:1(<module>)
    149/1    0.001    0.000    0.117    0.117 <frozen importlib._bootstrap>:958(_find_and_load)
    149/1    0.001    0.000    0.117    0.117 <frozen importlib._bootstrap>:931(_find_and_load_unlocked)
    138/1    0.001    0.000    0.117    0.117 <frozen importlib._bootstrap>:641(_load_unlocked)
    ...

結果が長いので、上位10個の関数を表示。最初の段階で194秒かかっていたのが、すでに2秒。これだけで結構爆速。numpyすごい。

ここらからさらに、インプレイスを駆使してメモリを使いまわすようにします。結果は以下の通り。


         64389 function calls (62337 primitive calls) in 1.346 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    125/1    0.003    0.000    1.346    1.346 {built-in method builtins.exec}
        1    0.000    0.000    1.346    1.346 diffusion_python_numpy_memory.py:3(<module>)
        1    0.002    0.002    1.239    1.239 diffusion_python_numpy_memory.py:24(run_experiment)
      500    0.233    0.000    1.236    0.002 diffusion_python_numpy_memory.py:18(evolve)
      500    0.588    0.001    1.004    0.002 diffusion_python_numpy_memory.py:9(laplacian)
     2000    0.360    0.000    0.369    0.000 numeric.py:1399(roll)
        5    0.000    0.000    0.161    0.032 __init__.py:1(<module>)
    149/1    0.001    0.000    0.107    0.107 <frozen importlib._bootstrap>:958(_find_and_load)
    149/1    0.001    0.000    0.107    0.107 <frozen importlib._bootstrap>:931(_find_and_load_unlocked)
    138/1    0.001    0.000    0.107    0.107 <frozen importlib._bootstrap>:641(_load_unlocked)
...

全体の計算時間だと1秒くらい削れていることがわがる。各関数でみるとrollにかかっていた時間が半分くらいになっている。
（メモリ確保にかかっている時間は、別の関数ではなくその関数の時間としてカウントされるのか。numpy使うときは気にしないといけないかも）

次、numpy.roll()が内部でメモリ確保をしているので、自作のroll_add()を定義して、インプレイスで処理できるようにする。
この結果は以下の通り。

1.155379295349121
         52456 function calls (50404 primitive calls) in 1.307 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    125/1    0.003    0.000    1.307    1.307 {built-in method builtins.exec}
        1    0.000    0.000    1.307    1.307 diffusion_python_numpy_memory2.py:3(<module>)
        1    0.002    0.002    1.156    1.156 diffusion_python_numpy_memory2.py:49(run_experiment)
      500    0.236    0.000    1.154    0.002 diffusion_python_numpy_memory2.py:43(evolve)
      500    0.108    0.000    0.918    0.002 diffusion_python_numpy_memory2.py:34(laplacian)
     2004    0.765    0.000    0.765    0.000 diffusion_python_numpy_memory2.py:9(roll_add)
        5    0.000    0.000    0.253    0.051 __init__.py:1(<module>)
    149/1    0.001    0.000    0.151    0.151 <frozen importlib._bootstrap>:958(_find_and_load)
    149/1    0.001    0.000    0.151    0.151 <frozen importlib._bootstrap>:931(_find_and_load_unlocked)
    138/1    0.001    0.000    0.151    0.151 <frozen importlib._bootstrap>:641(_load_unlocked)
   ...

本の通りだと70％高速化するらしいが、そうなっていない。line_profileの結果をほんと見比べると、本のときはnumpy.roll()のaxis=1の時の速度がaxis=0のときよりも遅いのに比べ、手元の環境だとaxis=0と1の速度が本のときほど差がない。恐らくだが、numpy、もしくはpythonのバージョン違いによる影響なのかもしれない。

とりあえず、これ以降のものはさらに突っ込んだ最適化をやっているが、コードが汚くなるのでやめておく。

まとめ

pythonスクリプトのチューニングを頑張ってやった経験がほぼなかったので、いろいろ挙動をみるとのは面白かった。また、本と挙動が違う部分も多々あったので自分の環境でどうなのか確認することの重要性を再確認することができた。

Pythonスクリプトのチューニング練習メモ

今回の実行環境

6章 行列とベクトルの計算のコード実行

まとめ

6章行列とベクトルの計算のコード実行