1.始めに
Raspberry pi4の8GB版が3台手元に来たので、たまにはクラスタでもと組んでみました。何を動かすか、というところで、定番のHPLを動かそうとしたところ、ネットの情報が新旧混合で翻弄されました。最終的に正解だけを選んでやると、非常にシンプルにすぐHPLができることがわかりました。Raspberry pi3の頃は、atlasを自分でコンパイルしたりと、セットアップに一日がかりだったのですが、raspberry pi4でOSも64bitになり、ライブラリもきちっとそろっているので、apt一発で環境が整うようになりました。自分へのメモを兼ねて残しておきます。
2.前提
- Raspberry Pi 4 8GB 3台
ただし、電源は20Wくらいのものを使用してください。あと冷却ファンは必須。 - micro SDカード3枚
HPLにストレージ性能は必要ないので、あまり神経質にならなくてよいです。
ubuntu server 22.04.3 LTS 64bit版をインストールしておきます。 - Ethernet
一応安定化のため、有線で接続します。
3.セットアップコマンド一式
以下のコマンド一式でHPLの単独動作確認までできます。
vi HPL.datでは、以下の内容を記載してください。
$ sudo apt update
$ sudo apt upgrade -y
$ sudo apt install -y make gcc libopenmpi-dev libopenblas-dev
$ wget https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
$ tar zxvf hpl-2.3.tar.gz
$ cd hpl-2.3/
$ ./configure
$ make
$ cd testing/
$ vi HPL.dat
$ ./xhpl
PLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10000 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
1 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
結果です。
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 10000
NB : 128
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 10000 128 1 1 46.42 1.4365e+01
HPL_pdgesv() start time Sun Nov 5 13:10:11 2023
HPL_pdgesv() end time Sun Nov 5 13:10:57 2023
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 5.53092509e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
この例では、14.3GFlops出ています。
4. 3台でクラスタ
この記事にそって、プライマリ側のキーをセカンダリに登録してください。できたらプライマリからセカンダリへのssh接続がパスワードなしで行われていることを確認してください。
次にホストを登録するmachinefileを作成します
<プライマリのipアドレス>
<セカンダリ1のipアドレス>
<セカンダリ2のipアドレス>
例
192.168.0.2
192.168.0.3
192.168.0.4
できたら、プライマリのhpl-2.3/testing/に行って、3台用のHPL.datを作成してください。
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
50000 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
3 Ps
4 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
で、以下のコマンドで実行してください。
$ mpirun --np 12 --hostfile machinefile ./xhpl
3台でxhplを実行し始めます。少々時間かかります。結果は以下のようになります。
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 50000
NB : 128
PMAP : Row-major process mapping
P : 3
Q : 4
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 50000 128 3 4 2128.47 3.9153e+01
HPL_pdgesv() start time Sun Nov 5 08:49:39 2023
HPL_pdgesv() end time Sun Nov 5 09:25:08 2023
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 3.30604606e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
39.1GFlops出ました。
最後に
手元のシングルボード組み合わせて40GFlops出る世の中なのですね。Raspberry Pi 5も期待できます。