TL;DR
-
numba
の並列化オプションについて実行速度を調査 (Numba で並列処理ができることを知ったので - Qiita を読んだので) - 比較対象
- no numba
@jit
@jit(nopython=True)
@jit(nopython=True, parallel=True)
-
@jit(nopython=True, parallel=True)
+numba.prange
@njit
@njit(parallel=True)
-
@njit(parallel=True)
+numba.prange
- とりあえず
@njit(parallel=True)
にしとけば良さそう - ループ数が多いときに
numba.prange
も併用すると良さげ - 処理内容で効果は違うと思うので、計測や**公式ドキュメントの確認**は忘れずに
検証環境
- MacBookPro(2016)
- Jupyter Notebook
$ system_profiler SPHardwareDataType
Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro13,3
Processor Name: Intel Core i7
Processor Speed: 2.7 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 8 MB
Memory: 16 GB
Boot ROM Version: 250.0.0.0.0
SMC Version (system): 2.38f7
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.14
BuildVersion: 18A391
$ pyenv --version
pyenv 1.1.5
$ python --version
Python 3.6.2 :: Anaconda, Inc.
(master *%<>) $ pip show jupyter numba
Name: jupyter
Version: 1.0.0
Name: numba
Version: 0.40.1
検証
Numba で並列処理ができることを知ったので - Qiita と同じコードを用いる
検証用コード
関数定義
import numba
from numba import jit, njit, prange
import random
def calc_pi(NUM):
counter = 0
for i in range(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
@jit
def calc_pi_jit(NUM):
counter = 0
for i in range(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
@jit(nopython=True)
def calc_pi_jit_nopython(NUM):
counter = 0
for i in range(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
@jit(nopython=True, parallel=True)
def calc_pi_jit_parallel(NUM):
counter = 0
for i in range(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
@jit(nopython=True, parallel=True)
def calc_pi_jit_prange(NUM):
counter = 0
for i in prange(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
@njit
def calc_pi_njit(NUM):
counter = 0
for i in range(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
@njit(parallel=True)
def calc_pi_njit_parallel(NUM):
counter = 0
for i in range(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
@njit(parallel=True)
def calc_pi_njit_prange(NUM):
counter = 0
for i in prange(NUM):
x = random.random()
y = random.random()
if x*x+y*y < 1.0:
counter += 1
pi = 4.0*counter/NUM
return pi
計測部
numba
なしは計測に時間がかかり過ぎるので 1,000,000
まで
for i in range(4):
num = pow(1000, i)
print(f'{"="*10} num={num} {"="*10}')
if i < 3:
print("no numba")
%timeit calc_pi(num)
print("jit only")
%timeit calc_pi_jit(num)
print("jit nopython")
%timeit calc_pi_jit_nopython(num)
print("jit parallel")
%timeit calc_pi_jit_parallel(num)
print("jit prange")
%timeit calc_pi_jit_prange(num)
print("njit only")
%timeit calc_pi_njit(num)
print("njit parallel")
%timeit calc_pi_njit_parallel(num)
print("njit prange")
%timeit calc_pi_njit_prange(num)
検証結果
テーブルにまとめるの面倒なので、Jupyter Notebookの結果をそのまま貼り付け。
-
parallel=True
は特に弊害無さそうなので、とりあえずで付けてても良さそう -
numba.prange
はループ数(並列数)が少ない場合は逆効果(オーバーヘッド増)- ループ数大 :
numba.prange
の恩恵大 - ループ数小 :
numba.prange
が悪影響
- ループ数大 :
========== num=1 ==========
no numba
729 ns ± 44.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
jit only
246 ns ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
jit nopython
211 ns ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
jit parallel
203 ns ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
jit prange
43.7 µs ± 2.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
njit only # 最速
196 ns ± 4.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
njit parallel
209 ns ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
njit prange
55.8 µs ± 9.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
========== num=1000 ==========
no numba
344 µs ± 56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jit only
17.7 µs ± 562 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
jit nopython # 最速
16.6 µs ± 875 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
jit parallel
16.6 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
jit prange
55.4 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
njit only
17.3 µs ± 983 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
njit parallel
16.7 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
njit prange
52.3 µs ± 4.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
========== num=1000000 ==========
no numba
300 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jit only
16.8 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
jit nopython
17.1 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jit parallel
17 ms ± 995 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jit prange
5.33 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
njit only
16.4 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
njit parallel
16.2 ms ± 519 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
njit prange # 最速
4.96 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
========== num=1000000000 ==========
jit only
16.5 s ± 481 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jit nopython
16.2 s ± 385 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jit parallel
16.4 s ± 876 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jit prange
4.58 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
njit only
16.8 s ± 995 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
njit parallel
16.4 s ± 702 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
njit prange # 最速
4.29 s ± 51.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
おわりに
処理内容で効果は違うと思うので、計測や**公式ドキュメントの確認**は忘れずに