姫野ベンチマークをPythonで実行し、[numpy, numba]で高速化。
姫野ベンチは、3次元ポアソン方程式をヤコビ法で解く反復処理の速度を図るベンチマークです。
本家のホームページは→こちら Fortran90とCの公式版があります。
20数年前からある古いベンチマークですが、流体解析に必要なメモリ律速のベンチマークに向いています。
Pythonを始めるにあたり、どの程度の速度がでるか興味があって始めました。
ベンチマーク環境
- CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz(Boost時3.8GHz) (4core)
- RAM: 32GB
- OS: CentOS Linux release 7.5.1804 (64bit)
- Fortran compiler: gfortran 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
- C compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
- Python: Python 3.6.3 (default, May 13 2018, 22:01:23) (Anaconda 5.1.0)
かなり古いCPUです。1coreでの計測です。
オリジナル版の計測結果
使用したソースとcompile option
- gfortran -Ofast himenoBMTxp.f90 コードのダウンロード(Fortran90)
- gcc -Ofast himenoBMTxpa.c コードのダウンロードの(C, dynamic allocate version)
ベンチマーク結果
単精度 | XS | S | M | L | XL |
---|---|---|---|---|---|
Fortran[Mflops] | 4011 | 3845 | 2591 | 3040 | 2881 |
C [Mflops] | 1808 | 1790 | 1250 | 791 | 407 |
配列サイズ(IxJxK) | 65x33x33 | 129x65x65 | 257x129x129 | 513x257x257 | 1025x513x513 |
メモリ使用量 | 5MB | 31MB | 235MB | 1,854MB | 14,752MB |
メモリ使用量は、Fortran版で/usr/timeの出力。C版は10%弱使用量が少ない。 | |||||
Fortranは安定して速い。Cは配列サイズが大きいと速度低下が大きい。 |
Python版
単精度 | XS | S | M | L | XL |
---|---|---|---|---|---|
pure_python | 1.9 | 2.1 | 2.1 | 2.1 | 2.1 |
numba order='C' | 1281 | 611 | 483 | 420 | 368 |
numba order='F' | 2216 | 1955 | 1558 | 1161 | 870 |
numpy order='C' | 1043 | 954 | 941 | 929 | 715 |
numpy order='F' | 1186 | 1180 | 1099 | 1078 | 821 |
ソースは、後で示します。試したのは、
- pure_python: Fortran版をなるべく変えないようPythonに移植したもの。配列はnumpy.ndarrayを使用
- numba : 上記のpure_python版に@numba.jit(nopython=True)を追加し、エラーが出たところを必要最小限で修正したもの。(nopython=True)は、numbaが失敗した際にpythonにデグレードせずエラーを出させるため。
- numpy : pure_pnuython版のループ部をnumpyの添え字をずらしたviewを与えることで、一行にまとめたもの
※:numba版とnumpy版はndarray作成時のorderを'C'と'F'の両方実施
実行方法
python himenoBMT_pure.py [XS|S|M|L|XL]サイズを指定してください
憶えたてのf-stringsを使っているので、 python3.6以上です。
結果(単精度)
- pure_pythonは、目も当てられない。
- numbaはCとほぼ互角の速度。ソースコードの変更も少なく、forループをそのまま書けば良いので、見通しが良く、コードのメンテナンスも容易。Fortranに対しても半分程度の速度が出るので、予想以上の結果。ifortと比較したいところ。
- numpyでも結構行ける。配列サイズが大きい場合には、numbaよりも速い場合がある。
- orderは'C'よりも'F'の方が速い。これは元のコードのループ順に依存すると思われる。
結論
PythonでもCに匹敵する速度で実行できるコードが書けそう
numbaはすごい、でも@numba.jit(nopython=True)で動かないと戸惑う
以下、ソースコード
まずは、pure_python版、なるべくオリジナルのFortran版を素直に移植したつもりだが、
Classを作る必要があったのか…。Python始めて2カ月の初心者ということで大目に見て下さい。
'''
Created on 2018/06/10
@author: akio
'''
import sys
import time
import numpy as np
class pres:
pres = []
class mtrx:
a = []
b = []
c = []
class bound:
bnd = []
class work:
wrk1 = []
wrk2 = []
class others:
imax = -1
jmax = -1
kmax = -1
mimax = -1
mjmax = -1
mkmax = -1
omega = 0.8
def readparam(size):
grid_set(size)
others.imax = others.mimax-1
others.jmax = others.mjmax-1
others.kmax = others.mkmax-1
return
def grid_set(size):
if size == "xs":
mimax = 65
mjmax = 33
mkmax = 33
elif size == "s":
mimax = 129
mjmax = 65
mkmax = 65
elif size == "m":
mimax=257
mjmax=129
mkmax=129
elif size == "l":
mimax=513
mjmax=257
mkmax=257
elif size == "xl":
mimax=1025
mjmax=513
mkmax=513
else:
print("Invalid input character !! {}".format(size))
sys.exit("-1")
others.mimax = mimax
others.mjmax = mjmax
others.mkmax = mkmax
def initmt():
pres.p[:,:,:] = 0
mtrx.a[:,:,:,:] = 0
mtrx.b[:,:,:,:] = 0
mtrx.c[:,:,:] = 0
bound.bnd[:,:,:] = 0.0
work.wrk1[:,:,:] = 0.0
work.wrk2[:,:,:] = 0.0
mtrx.a[0:others.imax,0:others.jmax,0:others.kmax,0:3] = 1.0
mtrx.a[0:others.imax,0:others.jmax,0:others.kmax,3] = 1.0/6.0
mtrx.c[0:others.imax,0:others.jmax,0:others.kmax,:] = 1.0
bound.bnd[0:others.imax,0:others.jmax,0:others.kmax] = 1.0
for k in range(others.kmax):
pres.p[:,:,k] = ((k-1)*(k-1))/((others.kmax-1)*(others.kmax-1))
def initmem():
dim = (others.mimax, others.mjmax, others.mkmax)
dim4 = (others.mimax, others.mjmax, others.mkmax,4)
dim3 = (others.mimax, others.mjmax, others.mkmax,3)
pres.p = np.ndarray(dim, dtype=np.float32)
mtrx.a = np.ndarray(dim4, dtype=np.float32)
mtrx.b = np.ndarray(dim3, dtype=np.float32)
mtrx.c = np.ndarray(dim3, dtype=np.float32)
bound.bnd = np.ndarray(dim, dtype=np.float32)
work.wrk1 = np.ndarray(dim, dtype=np.float32)
work.wrk2 = np.ndarray(dim, dtype=np.float32)
def jacobi(nn):
#gosa = np.array(0.0, dtype=np.float32)
for loop in range(nn):
gosa = np.zeros((1), dtype=np.float32)
#ss = 0.0 #np.array(0.0, np.float32)
for k in range(1,others.kmax-1):
for j in range(1,others.jmax-1):
for i in range(1,others.imax-1):
s0 = mtrx.a[i,j,k,0]*pres.p[i+1,j,k] \
+mtrx.a[i,j,k,1]*pres.p[i,j+1,k] \
+mtrx.a[i,j,k,2]*pres.p[i,j,k+1] \
+mtrx.b[i,j,k,0]*(pres.p[i+1,j+1,k]-pres.p[i+1,j-1,k] \
-pres.p[i-1,j+1,k]+pres.p[i-1,j-1,k]) \
+mtrx.b[i,j,k,1]*(pres.p[i,j+1,k+1]-pres.p[i,j-1,k+1] \
-pres.p[i,j+1,k-1]+pres.p[i,j-1,k-1]) \
+mtrx.b[i,j,k,2]*(pres.p[i+1,j,k+1]-pres.p[i-1,j,k+1] \
-pres.p[i+1,j,k-1]+pres.p[i-1,j,k-1]) \
+mtrx.c[i,j,k,0]*pres.p[i-1,j,k] \
+mtrx.c[i,j,k,1]*pres.p[i,j-1,k] \
+mtrx.c[i,j,k,2]*pres.p[i,j,k-1]+work.wrk1[i,j,k]
ss=(s0*mtrx.a[i,j,k,3]-pres.p[i,j,k])*bound.bnd[i,j,k]
gosa[0] = gosa[0] + ss*ss
work.wrk2[i,j,k]=pres.p[i,j,k]+ others.omega*ss
if loop < 4: print('loop: {} gosa: {}'.format(loop,gosa[0]))
pres.p[1:others.imax-1,1:others.jmax-1,1:others.kmax-1] = \
work.wrk2[1:others.imax-1,1:others.jmax-1,1:others.kmax-1]
return float(gosa[0])
def second():
return time.time()
if __name__ == '__main__':
ttarget = 10.0 # 60
xmflops2 = 0.0
gosa = 0.0
if len(sys.argv) > 1:
size = sys.argv[1].lower()
else:
size = 's'
readparam(size)
initmem()
initmt()
print(' mimax={} mjmax={} mkmax={}'
.format(others.mimax,others.mjmax,others.mkmax))
print(' imax={} jmax={} kmax={}'
.format(others.imax,others.jmax,others.kmax))
dt= time.get_clock_info('time').resolution
print(' {:s} , {:12.5e}'.format('Time measurement accuracy : ',dt))
## Start measuring
nn=3
print(' Start rehearsal measurement process.')
print(' Measure the performance in 3 times.')
cpu0=second()
## Jacobi iteration
gosa = jacobi(nn)
cpu1=second()
cpu = (cpu1 - cpu0)
flop=(others.kmax-2)*(others.jmax-2)*(others.imax-2)*34.0
if (cpu != 0.0): xmflops2 = flop/cpu*1.0e-6*nn
print(' MFLOPS:{} time(s):{}'
.format(xmflops2,cpu))
#
## end the test loop
nn=int(ttarget/(cpu/3.0))+1
print(' Gosa : {}'.format(gosa))
print('Now, start the actual measurement process.')
print('The loop will be excuted in {} times.'.format(nn))
print('This will take about one minute.')
print('Wait for a while.')
#
## Jacobi iteration
cpu0= second()
## Jacobi iteration
gosa = jacobi(nn)
#
cpu1= second()
cpu = (cpu1 - cpu0)
if (cpu != 0.0): xmflops2 = flop*1.0e-6/cpu*nn
#
### xmflops2=nflop/cpu*1.0e-6*float(nn)
#
print('The loop will be excuted in {} times.'.format(nn))
print(' Gosa : {}'.format(gosa))
print(' MFLOPS:{} time(s):{}'
.format(xmflops2,cpu))
score=xmflops2/82.84
print(' Score based on Pentium III 600MHz : {}'.format(score))
#
# pause
# stop
numbaバージョン
次は、numba版。pure_python版に@numba.jitを付けただけだと、実行時にjacobi関数の中で、pres.pやmtrx.aは知らない!と怒られたので、全て引数で渡すように変更した。結果では、order='C'よりもorder='F'の方が速かったが、ループ順をオリジナルのkji順からijk順に変更するとorder='C'の方が速い。実は、オリジナルのkjiでorder='F'よりも、ijk順でorder='C'の方が1割ほど速かった(複雑になるので、数字は載せていません)。
'''
Created on 2018/06/10
@author: akio
'''
import sys
import time
import numpy as np
import numba
class pres:
pres = []
class mtrx:
a = []
b = []
c = []
class bound:
bnd = []
class work:
wrk1 = []
wrk2 = []
class others:
imax = -1
jmax = -1
kmax = -1
mimax = -1
mjmax = -1
mkmax = -1
omega = 0.8
def readparam(size):
grid_set(size)
others.imax = others.mimax-1
others.jmax = others.mjmax-1
others.kmax = others.mkmax-1
return
def grid_set(size):
if size == "xs":
mimax = 65
mjmax = 33
mkmax = 33
elif size == "s":
mimax = 129
mjmax = 65
mkmax = 65
elif size == "m":
mimax=257
mjmax=129
mkmax=129
elif size == "l":
mimax=513
mjmax=257
mkmax=257
elif size == "xl":
mimax=1025
mjmax=513
mkmax=513
else:
print("Invalid input character !! {}".format(size))
sys.exit("-1")
others.mimax = mimax
others.mjmax = mjmax
others.mkmax = mkmax
def initmt():
pres.p[:,:,:] = 0
mtrx.a[:,:,:,:] = 0
mtrx.b[:,:,:,:] = 0
mtrx.c[:,:,:] = 0
bound.bnd[:,:,:] = 0.0
work.wrk1[:,:,:] = 0.0
work.wrk2[:,:,:] = 0.0
mtrx.a[0:others.imax,0:others.jmax,0:others.kmax,0:3] = 1.0
mtrx.a[0:others.imax,0:others.jmax,0:others.kmax,3] = 1.0/6.0
mtrx.c[0:others.imax,0:others.jmax,0:others.kmax,:] = 1.0
bound.bnd[0:others.imax,0:others.jmax,0:others.kmax] = 1.0
for k in range(others.kmax):
pres.p[:,:,k] = ((k-1)*(k-1))/((others.kmax-1)*(others.kmax-1))
def initmem():
dim = (others.mimax, others.mjmax, others.mkmax)
dim4 = (others.mimax, others.mjmax, others.mkmax,4)
dim3 = (others.mimax, others.mjmax, others.mkmax,3)
atype = np.float32
aorder = 'F'
pres.p = np.ndarray(dim, dtype=atype, order=aorder)
mtrx.a = np.ndarray(dim4, dtype=atype, order=aorder)
mtrx.b = np.ndarray(dim3, dtype=atype, order=aorder)
mtrx.c = np.ndarray(dim3, dtype=atype, order=aorder)
bound.bnd = np.ndarray(dim, dtype=atype, order=aorder)
work.wrk1 = np.ndarray(dim, dtype=atype, order=aorder)
work.wrk2 = np.ndarray(dim, dtype=atype, order=aorder)
#@numba.jit('(i8,f8,i8,i8,i8,i8,i8,i8,f8,f4[:,:,:],f4[:,:,:,:],f4[:,:,:,:],f4[:,:,:,:],f4[:,:,:],f4[:,:,:],f4[:,:,:])',nopython=True)
@numba.jit(nopython=True)
def jacobi(nn,mimax,mjmax,mkmax,imax,jmax,kmax,omega \
,p,a,b,c,bnd,wrk1,wrk2):
gosa = np.zeros((1), dtype=np.float32)
for loop in range(nn):
gosa[0] = 0.0
#ss = np.array(0.0, np.float32)
for k in range(1,kmax-1):
for j in range(1,jmax-1):
for i in range(1,imax-1):
s0 = a[i,j,k,0]*p[i+1,j,k] \
+a[i,j,k,1]*p[i,j+1,k] \
+a[i,j,k,2]*p[i,j,k+1] \
+b[i,j,k,0]*(p[i+1,j+1,k]-p[i+1,j-1,k] \
-p[i-1,j+1,k]+p[i-1,j-1,k]) \
+b[i,j,k,1]*(p[i,j+1,k+1]-p[i,j-1,k+1] \
-p[i,j+1,k-1]+p[i,j-1,k-1]) \
+b[i,j,k,2]*(p[i+1,j,k+1]-p[i-1,j,k+1] \
-p[i+1,j,k-1]+p[i-1,j,k-1]) \
+c[i,j,k,0]*p[i-1,j,k] \
+c[i,j,k,1]*p[i,j-1,k] \
+c[i,j,k,2]*p[i,j,k-1]+wrk1[i,j,k]
ss=(s0*a[i,j,k,3]-p[i,j,k])*bnd[i,j,k]
gosa[0] = gosa[0]+ss*ss
wrk2[i,j,k]=p[i,j,k]+ omega*ss
#if loop < 4: print(f'loop: {loop} gosa: {gosa}')
#print('loop: {} gosa: {}'.format(loop,gosa))
#print(gosa)
p[1:imax-1,1:jmax-1,1:kmax-1] = \
wrk2[1:imax-1,1:jmax-1,1:kmax-1]
return float(gosa[0])
def second():
return time.time()
if __name__ == '__main__':
ttarget = 60.0 # 60
xmflops2 = 0.0
gosa = 0.0
if len(sys.argv) > 1:
size = sys.argv[1].lower()
else:
size = 's'
readparam(size)
initmem()
initmt()
''' numba jit'''
nn=1
gosa = jacobi(nn,others.mimax,others.mjmax,others.mkmax \
,others.imax,others.jmax,others.kmax,others.omega \
,pres.p,mtrx.a,mtrx.b,mtrx.c,bound.bnd,work.wrk1,work.wrk2)
initmt()
print(f' mimax={others.mimax} mjmax={others.mjmax} mkmax={others.mkmax}')
print(f' imax={others.imax} jmax={others.jmax} kmax={others.kmax}')
dt= time.get_clock_info('time').resolution
print(' %a , %10.5e' % ('Time measurement accuracy : ',dt))
## Start measuring
nn=3
print(' Start rehearsal measurement process.')
print(' Measure the performance in 3 times.')
cpu0=second()
## Jacobi iteration
gosa = jacobi(nn,others.mimax,others.mjmax,others.mkmax \
,others.imax,others.jmax,others.kmax,others.omega \
,pres.p,mtrx.a,mtrx.b,mtrx.c,bound.bnd,work.wrk1,work.wrk2)
cpu1=second()
cpu = (cpu1 - cpu0)
flop=(others.kmax-2)*(others.jmax-2)*(others.imax-2)*34.0
if (cpu != 0.0): xmflops2 = flop/cpu*1.0e-6*nn
print(f' MFLOPS:{xmflops2} time(s):{cpu} {gosa}')
#
## end the test loop
nn=int(ttarget/(cpu/3.0))+1
print('Now, start the actual measurement process.')
print(f'The loop will be excuted in {nn} times.')
print('This will take about one minute.')
print('Wait for a while.')
#
## Jacobi iteration
cpu0= second()
## Jacobi iteration
gosa = jacobi(nn,others.mimax,others.mjmax,others.mkmax \
,others.imax,others.jmax,others.kmax,others.omega \
,pres.p,mtrx.a,mtrx.b,mtrx.c,bound.bnd,work.wrk1,work.wrk2)
#
cpu1= second()
cpu = (cpu1 - cpu0)
if (cpu != 0.0): xmflops2 = flop*1.0e-6/cpu*nn
#
### xmflops2=nflop/cpu*1.0e-6*float(nn)
#
print(f' Loop executed for {nn} times')
print(f' Gosa : {gosa}')
print(f' MFLOPS: {xmflops2} time(s):{cpu}')
score=xmflops2/82.84
print(f' Score based on Pentium III 600MHz : {score}')
#
# pause
# stop
numpyバージョン
下にソースのコア部分の抜粋を示します。コメントを参照してください。
def jacobi(nn,mimax,mjmax,mkmax,imax,jmax,kmax,omega \
,p,a,b,c,bnd,wrk1,wrk2,s0,ss):
p111 = p[1:imax-1,1:jmax-1,1:kmax-1]
p211 = p[2:imax ,1:jmax-1,1:kmax-1]
p121 = p[1:imax-1,2:jmax ,1:kmax-1]
p112 = p[1:imax-1,1:jmax-1,2:kmax ]
'''
中略。
上記のように、添字をずらしたndarrayのviewを作り、
'''
for loop in range(nn):
```
k,j,iのループを無くして、
numpy配列の演算に置き換えた。
※注意事項(はまったところ)
s0[:,:,:]=...の "[:,:,:]"を忘れて、s0=とすると、
s0を計算結果の新しいndarrayで上書きしてしまうので注意。
```
s0111[:,:,:] = a0*p211 \
+a1* p121 \
+a2* p112 \
+b0*( p221 - p201 \
-p021 + p001) \
+b1*( p122 - p102 \
-p120 + p100) \
+b2*( p212 - p012 \
-p210 + p010) \
+c0*p011 \
+c1*p101 \
+c2*p110 + w1
ss111[:,:,:] = (s0111*a3-p111)*bnd111
以下、numpy版の全体です。
'''
Created on 2018/06/10
@author: akio
'''
import sys
import time
from ctypes import c_float
import numpy as np
import numba
class pres:
pres = []
class mtrx:
a = []
b = []
c = []
class bound:
bnd = []
class work:
wrk1 = []
wrk2 = []
class others:
imax = -1
jmax = -1
kmax = -1
mimax = -1
mjmax = -1
mkmax = -1
omega = 0.8
def readparam(size):
grid_set(size)
others.imax = others.mimax-1
others.jmax = others.mjmax-1
others.kmax = others.mkmax-1
return
def grid_set(size):
if size == "xs":
mimax = 65
mjmax = 33
mkmax = 33
elif size == "s":
mimax = 129
mjmax = 65
mkmax = 65
elif size == "m":
mimax=257
mjmax=129
mkmax=129
elif size == "l":
mimax=513
mjmax=257
mkmax=257
elif size == "xl":
mimax=1025
mjmax=513
mkmax=513
else:
print("Invalid input character !! {}".format(size))
sys.exit("-1")
others.mimax = mimax
others.mjmax = mjmax
others.mkmax = mkmax
def initmt():
pres.p[:,:,:] = 0
mtrx.a[:,:,:,:] = 0
mtrx.b[:,:,:,:] = 0
mtrx.c[:,:,:] = 0
bound.bnd[:,:,:] = 0.0
work.wrk1[:,:,:] = 0.0
work.wrk2[:,:,:] = 0.0
work.s0[:,:,:] = 0.0
work.ss[:,:,:] = 0.0
mtrx.a[0:others.imax,0:others.jmax,0:others.kmax,0:3] = 1.0
mtrx.a[0:others.imax,0:others.jmax,0:others.kmax,3] = 1.0/6.0
mtrx.c[0:others.imax,0:others.jmax,0:others.kmax,:] = 1.0
bound.bnd[0:others.imax,0:others.jmax,0:others.kmax] = 1.0
for k in range(others.kmax):
pres.p[:,:,k] = ((k-1)*(k-1))/((others.kmax-1)*(others.kmax-1))
def initmem():
dim = (others.mimax, others.mjmax, others.mkmax)
dim4 = (others.mimax, others.mjmax, others.mkmax,4)
dim3 = (others.mimax, others.mjmax, others.mkmax,3)
atype = np.float32
aorder = 'F'
pres.p = np.ndarray(dim, dtype=atype, order=aorder)
mtrx.a = np.ndarray(dim4, dtype=atype, order=aorder)
mtrx.b = np.ndarray(dim3, dtype=atype, order=aorder)
mtrx.c = np.ndarray(dim3, dtype=atype, order=aorder)
bound.bnd = np.ndarray(dim, dtype=atype, order=aorder)
work.wrk1 = np.ndarray(dim, dtype=atype, order=aorder)
work.wrk2 = np.ndarray(dim, dtype=atype, order=aorder)
work.s0 = np.ndarray(dim, dtype=atype, order=aorder)
work.ss = np.ndarray(dim, dtype=atype, order=aorder)
#@numba.jit('(i8,f8,i8,i8,i8,i8,i8,i8,f8,f4[:,:,:],f4[:,:,:,:],f4[:,:,:,:],f4[:,:,:,:],f4[:,:,:],f4[:,:,:],f4[:,:,:])',nopython=True)
#@numba.jit(nopython=True)
def jacobi(nn,mimax,mjmax,mkmax,imax,jmax,kmax,omega \
,p,a,b,c,bnd,wrk1,wrk2,s0,ss):
#gosa = np.zeros((1), dtype=np.float32)
a0 = a[1:imax-1,1:jmax-1,1:kmax-1,0]
a1 = a[1:imax-1,1:jmax-1,1:kmax-1,1]
a2 = a[1:imax-1,1:jmax-1,1:kmax-1,2]
a3 = a[1:imax-1,1:jmax-1,1:kmax-1,3]
b0 = b[1:imax-1,1:jmax-1,1:kmax-1,0]
b1 = b[1:imax-1,1:jmax-1,1:kmax-1,1]
b2 = b[1:imax-1,1:jmax-1,1:kmax-1,2]
c0 = c[1:imax-1,1:jmax-1,1:kmax-1,0]
c1 = c[1:imax-1,1:jmax-1,1:kmax-1,1]
c2 = c[1:imax-1,1:jmax-1,1:kmax-1,2]
p111 = p[1:imax-1,1:jmax-1,1:kmax-1]
p211 = p[2:imax ,1:jmax-1,1:kmax-1]
p121 = p[1:imax-1,2:jmax ,1:kmax-1]
p112 = p[1:imax-1,1:jmax-1,2:kmax ]
p221 = p[2:imax ,2:jmax ,1:kmax-1]
p201 = p[2:imax ,0:jmax-2,1:kmax-1]
p021 = p[0:imax-2,2:jmax ,1:kmax-1]
p001 = p[0:imax-2,0:jmax-2,1:kmax-1]
p122 = p[1:imax-1,2:jmax ,2:kmax ]
p102 = p[1:imax-1,0:jmax-2,2:kmax ]
p120 = p[1:imax-1,2:jmax ,0:kmax-2]
p100 = p[1:imax-1,0:jmax-2,0:kmax-2]
p212 = p[2:imax ,1:jmax-1,2:kmax ]
p012 = p[0:imax-2,1:jmax-1,2:kmax ]
p210 = p[2:imax ,1:jmax-1,0:kmax-2]
p010 = p[0:imax-2,1:jmax-1,0:kmax-2]
p011 = p[0:imax-2,1:jmax-1,1:kmax-1]
p101 = p[1:imax-1,0:jmax-2,1:kmax-1]
p110 = p[1:imax-1,1:jmax-1,0:kmax-2]
w1 = wrk1[1:imax-1,1:jmax-1,1:kmax-1]
w2 = wrk2[1:imax-1,1:jmax-1,1:kmax-1]
s0111 = s0[1:imax-1,1:jmax-1,1:kmax-1]
ss111 = ss[1:imax-1,1:jmax-1,1:kmax-1]
bnd111 = bnd[1:imax-1,1:jmax-1,1:kmax-1]
for loop in range(nn):
s0111[:,:,:] = a0*p211 \
+a1* p121 \
+a2* p112 \
+b0*( p221 - p201 \
-p021 + p001) \
+b1*( p122 - p102 \
-p120 + p100) \
+b2*( p212 - p012 \
-p210 + p010) \
+c0*p011 \
+c1*p101 \
+c2*p110 + w1
ss111[:,:,:] = (s0111*a3-p111)*bnd111
w2[:,:,:] = p111+ omega*ss111
ss111[:,:,:] = ss111*ss111
gosa = float(np.sum(ss111, dtype=np.float32))
#if loop < 4: print(f'loop: {loop} gosa: {gosa}')
#print('loop: {} gosa: {}'.format(loop,gosa))
if loop < 3: print(gosa)
p111[:,:,:] = \
w2
return float(gosa)
def second():
return time.time()
if __name__ == '__main__':
ttarget = 60.0 # 60
xmflops2 = 0.0
gosa = 0.0
if len(sys.argv) > 1:
size = sys.argv[1].lower()
else:
size = 's'
readparam(size)
initmem()
initmt()
print(f' mimax={others.mimax} mjmax={others.mjmax} mkmax={others.mkmax}')
print(f' imax={others.imax} jmax={others.jmax} kmax={others.kmax}')
dt= time.get_clock_info('time').resolution
print(' %a , %10.5e' % ('Time measurement accuracy : ',dt))
## Start measuring
nn=3
print(' Start rehearsal measurement process.')
print(' Measure the performance in 3 times.')
cpu0=second()
## Jacobi iteration
gosa = jacobi(nn,others.mimax,others.mjmax,others.mkmax \
,others.imax,others.jmax,others.kmax,others.omega \
,pres.p,mtrx.a,mtrx.b,mtrx.c,bound.bnd \
,work.wrk1,work.wrk2,work.s0,work.ss)
cpu1=second()
cpu = (cpu1 - cpu0)
flop=(others.kmax-2)*(others.jmax-2)*(others.imax-2)*34.0
if (cpu != 0.0): xmflops2 = flop/cpu*1.0e-6*nn
print(f' MFLOPS:{xmflops2} time(s):{cpu} {gosa}')
#
## end the test loop
nn=int(ttarget/(cpu/3.0))+1
print('Now, start the actual measurement process.')
print(f'The loop will be excuted in {nn} times.')
print('This will take about one minute.')
print('Wait for a while.')
#
## Jacobi iteration
cpu0= second()
## Jacobi iteration
gosa = jacobi(nn,others.mimax,others.mjmax,others.mkmax \
,others.imax,others.jmax,others.kmax,others.omega \
,pres.p,mtrx.a,mtrx.b,mtrx.c,bound.bnd \
,work.wrk1,work.wrk2,work.s0,work.ss)
#
cpu1= second()
cpu = (cpu1 - cpu0)
if (cpu != 0.0): xmflops2 = flop*1.0e-6/cpu*nn
#
### xmflops2=nflop/cpu*1.0e-6*float(nn)
#
print(f' Loop executed for {nn} times')
print(f' Gosa : {gosa}')
print(f' MFLOPS: {xmflops2} time(s):{cpu}')
score=xmflops2/82.84
print(f' Score based on Pentium III 600MHz : {score}')
#
# pause
# stop
後書き
ソースコードを添付すると無駄に長くなってしまい、読み難くすみません。