More than 5 years have passed since last update.

リバティ・フィッシュ株式会社

RasPiでHPL

Last updated at 2018-09-09Posted at 2018-09-08

あるあるネタですが、RasPi3でHPLをビルドして実行します。

1.ハードウェア環境の準備

1.ヒートシンクを準備しましょう(必須)
2.小型扇風機を準備しましょう(必須)
3.安定した電源を準備しましょう
4.クラスタを組む時はスループット性能の良いHUBも準備しましょう

2.OSの準備

定番のアップデートは済ませましょう。

pi@rpicluster00:~ $ sudo apt-get update
pi@rpicluster00:~ $ sudo apt-get upgrade

これは最低限必要な感じ。

pi@rpicluster00:~ $ sudo apt-get install gfortran automake

任意でホスト名の設定
https://qiita.com/hinemoss/items/fd304d1c1754c37f6879

任意で固定IPアドレスの設定
https://qiita.com/hinemoss/items/ac29e51c9087d932b430

3.BLAS(ATLAS)のビルド

まずはBLAS(Basic Linear Algebra Subprograms)のビルドです。BLASはベクトルと行列の演算をしてくれるライブラリらしいです。詳しくはwikiでどうぞ。

3-1.atlasのダウンロード

pi@rpicluster00:~ $ cd
pi@rpicluster00:~ $ wget https://sourceforge.net/projects/math-atlas/files/Stable/3.10.3/atlas3.10.3.tar.bz2

3-2.atlasの解凍

pi@rpicluster00:~ $ tar xjvf atlas3.10.3.tar.bz2

3-3.ビルド用のディレクトリ作成

pi@rpicluster00:~ $ mkdir atlas-build
pi@rpicluster00:~ $ cd atlas-build/

3-4.CPUの性能が落ちないように設定

pi@rpicluster00:~/atlas-build $ echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

3-5.気合いを入れてconfigureとmake

扇風機準備OK?
もちろん置いとくだけじゃダメです。
ちゃんとグルグル回ってます?
ホントに?

pi@rpicluster00:~/atlas-build $ ../ATLAS/configure
pi@rpicluster00:~/atlas-build $ make

たまに4core使ってみたり、色々最適化しながら進みます。大体12時間から14時間ぐらいかかります。僕が用意した扇風機は気が効いていて、10時間で自動停止してくれます。あらステキ！でも、これでは正しい値が取れないので失敗です。2回目は、扇風機を途中で再起動しました。

3-6.ビルドできたらインストール

参考にさせていただいたページではinstallはなかったんですが、僕はこの後のビルドに支障がでたのでinstallしときます。

pi@rpicluster00:~/atlas-build $ sudo make install

4.mpichのビルド

次はmpichのビルドです。mpichは分散処理の時のメッセージをうまくやりとりしてくれるソフトウェアらしいです。こちらも詳しくはwikiでどうぞ。

4-1.mpichのダウンロードと解凍

pi@rpicluster00:~/atlas-build $ cd
pi@rpicluster00:~ $ wget http://www.mpich.org/static/downloads/3.2/mpich-3.2.tar.gz
pi@rpicluster00:~ $ tar xzvf mpich-3.2.tar.gz

4-2.configureとmake(その2)

こっちはatlasほどは気合いはいりません。けど、扇風機のスイッチON！
あと、時間もそこそこかかります。

pi@rpicluster00:~ $ cd mpich-3.2
pi@rpicluster00:~/mpich-3.2 $ ./configure
pi@rpicluster00:~/mpich-3.2 $ make -j 4
pi@rpicluster00:~/mpich-3.2 $ sudo make install

5.HPLのビルド

いよいよ、HPLのビルドです。

5-1.ダウンロードとカスタマイズ前処理

pi@rpicluster00:~/mpich-3.2 $ cd
pi@rpicluster00:~ $ wget http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz
pi@rpicluster00:~ $ tar xzvf hpl-2.2.tar.gz
pi@rpicluster00:~ $ cd hpl-2.2/setup
pi@rpicluster00:~/hpl-2.2/setup $ sh make_generic
pi@rpicluster00:~/hpl-2.2/setup $ cp Make.UNKNOWN ../Make.rpi
pi@rpicluster00:~/hpl-2.2/setup $ cd ..
pi@rpicluster00:~/hpl-2.2 $

5-2.makeファイルのカスタマイズ

エディタでMake.rpiを開いて、以下の箇所を修正します。

pi@rpicluster00:~/hpl-2.2 $ vi Make.rpi

ARCH = rpi
TOPdir       = $(HOME)/hpl-2.2
MPdir        = /usr/local
MPinc        = -I $(MPdir)/include
MPlib        = /usr/local/lib/libmpich.so
LAdir        = /home/pi/atlas-build
LAinc        =
LAlib        = $(LAdir)/lib/libf77blas.a $(LAdir)/lib/libatlas.a

5-3.気合を入れてmake

扇風機をONしましょう。

pi@rpicluster00:~/hpl-2.2 $ make arch=rpi

5-4.HPL定義ファイルの編集

エディタで定義ファイルの編集をします。

pi@rpicluster00:~/hpl-2.2 $ cd bin/rpi
pi@rpicluster00:~/hpl-2.2/bin/rpi $ vi HPL.dat

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
10000        Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

Nsは10000を設定していますが、利用できるメモリによって調整します。Nsの値が大きいほど、結果は良くなる傾向にあります。僕の場合、4nodeでの実行時、20000ぐらいは指定できました。

PsとQsは利用したいcore数によって調整します。
1x1だとRasPiは4coreのうちの1つを使用して計算してくれるようです。
2x2だと4core、2x8や4x4だと4nodeで4coreを使用して計算してくれるようです。
2core以上を使用する場合は、別のファイルを用意して、使用したいnodeを記述します。

5-5.実行(1node x 1core)

ちなみに、メモリの空き状況です。

pi@rpicluster00:~/hpl-2.2/bin/rpi $ free -h
             total       used       free     shared    buffers     cached
Mem:          970M        81M       889M       6.5M       8.5M        39M
-/+ buffers/cache:        32M       938M
Swap:           0B         0B         0B
pi@rpicluster00:~/hpl-2.2/bin/rpi $

CPUのクロックは1.2GHzでがんばるように設定しています。
https://qiita.com/hinemoss/items/65d89f92840ab8ee6ba2

では、扇風機をONして、1x1で実行してみます！
ちなみにmicroSDカードへのアクセスランプが頻繁に点滅したり、点灯したままだとペケです。GPUに割り当てられているメモリを16MBにするなり、少しでも使えるメモリを増やしましょう。それでもだめならNsの値を減らしましょう。
実行中、メモリが足りているか不安になります。そういうときは計算に影響はありますが、別にコンソールを作ってfreeで様子を見ましょう。足りてそうなら一度止めて、計算をやり直しましょう。足りていない時は、計算が止まることもありますし、応答が遅くなくこともあります。

pi@rpicluster00:~/hpl-2.2/bin/rpi $ ./xhpl
================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   10000
NB     :     128
PMAP   : Row-major process mapping
P      :       1
Q      :       1
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       10000   128     1     1             522.02              1.277e+00
HPL_pdgesv() start time Sat Sep  8 22:36:00 2018

HPL_pdgesv() end time   Sat Sep  8 22:44:42 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0016257 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
pi@rpicluster00:~/hpl-2.2/bin/rpi $

結果は、指数表示されている 1.277e+00 Gflops ですね。やったね！

5-6.実行(1node x 4core)

1nodeで4coreを使用したい場合は、計算で使用したいホストのIPアドレスの一覧が記述されてたファイルを作成します。名前は何でも良いのですが、例えばnodelistというファイルに、以下の内容を記述して

localhost
localhost
localhost
localhost

定義ファイルも調整して

10000        Ns
2            Ps
2            Qs

実行してみます。

pi@rpicluster00:~/hpl-2.2/bin/rpi $ mpiexec -f nodelist ./xhpl
================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   10000
NB     :     128
PMAP   : Row-major process mapping
P      :       2
Q      :       2
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       10000   128     2     2             159.78              4.173e+00
HPL_pdgesv() start time Sat Sep  8 22:56:32 2018

HPL_pdgesv() end time   Sat Sep  8 22:59:12 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0016969 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
pi@rpicluster00:~/hpl-2.2/bin/rpi $

結果は、4.173e+00 Gflops。単純にx4にはならないようです。そりゃそうですよね！

5-7.実行(4node x 4core)

参考までに、4nodeで実行した際の設定と結果。
実行には、他のnodeにパスワードなしでssh接続できるように設定が必要です。
https://qiita.com/hinemoss/items/2bdc2293c84bc1001214

172.16.1.100
172.16.1.100
172.16.1.100
172.16.1.100
172.16.1.101
172.16.1.101
172.16.1.101
172.16.1.101
172.16.1.102
172.16.1.102
172.16.1.102
172.16.1.102
172.16.1.103
172.16.1.103
172.16.1.103
172.16.1.103

20000        Ns
2            Ps
8            Qs

pi@rpicluster00:~/hpl-2.2/bin/rpi $ mpiexec -f nodelist ./xhpl
================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   20000
NB     :     128
PMAP   : Row-major process mapping
P      :       2
Q      :       8
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     8             469.24              1.137e+01
HPL_pdgesv() start time Sat Sep  8 23:13:03 2018

HPL_pdgesv() end time   Sat Sep  8 23:20:53 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009450 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
pi@rpicluster00:~/hpl-2.2/bin/rpi $

指数表示されているのが 1.137e+01 Gflops なのでx10して 11.3Gflopsぐらいですね。

参考にさせていただいたのはコチラ
メモのページ　-　チラシの裏メモ別紙
http://d.hatena.ne.jp/debslink/20160724/1469358050
http://d.hatena.ne.jp/debslink/20160801/1469980976
http://d.hatena.ne.jp/debslink/20160827/1472285472
http://d.hatena.ne.jp/debslink/20160902/1472819758

Compute Nodes
https://computenodes.net/2018/06/28/building-hpl-an-atlas-for-the-raspberry-pi/

HowToForge
https://www.howtoforge.com/tutorial/hpl-high-performance-linpack-benchmark-raspberry-pi/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up