概要
SX-Aurora TSUBASA (SXAT)を使える機会を得たので,streamベンチマークをやってみました.カタログスペックに近い値が得られ,非常に期待が持てます.
環境
- Vector Engine Type 10C
- nfort 3.5.1
stream ベンチマーク
streamベンチマークは,メモリバンド幅と対応する計算速度を評価するためのベンチマークです.主にメモリ帯域の評価に用いられており,NVIDIA GPUのメモリ帯域評価やIntel Xeon Phiのメモリ帯域評価に用いられている記事を見つけることができます.
SX-Aurora TSUBASAでもstream ベンチマークが実行されていることが,シンポジウムでの発表資料pdfから確認できます.
stream ベンチマークは,バージニア大学にあるページ https://www.cs.virginia.edu/stream/ → Source Code Directory から入手できます.ソースファイルは,C言語版stream.c
とFORTRAN版stream.f
が用意されています.本記事ではFORTRAN版を利用します.
strema ベンチマークのソースをビルドするには,入手したstream.f
に加えて,stream.f
内で参照されている外部手続mysecond()
を実装する必要があります.myseconds
は,wall-clock time(秒)を倍精度実数で返す関数です.Fortranを使って,下記のように実装しました.
double precision function mysecond()
use, intrinsic :: iso_fortran_env
implicit none
call cpu_time(mysecond)
end function mysecond
また,stream ベンチマークは,メモリ帯域をMB/s単位で出力することを想定しており,SXATのメモリ帯域を表示するには表示桁が足りません.そのため,stream.f
を下記のように修正しました.
- WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6,
+ WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D9,
- 9040 FORMAT ('Function',5x,'Rate (MB/s) Avg time Min time Max time'
+ 9040 FORMAT ('Function',5x,'Rate (GB/s) Avg time Min time Max time'
また,配列サイズも適当に大きくしておきました.
- PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
+ PARAMETER (n=80000000,offset=0,ndim=n+offset,ntimes=10)
OpenMP並列化
ビルドにはnfortを用い,OpenMPを有効にします.
$ nfort -fopenmp stream.f mysecond.f90
OpenMPのスレッド数を制御するには,環境変数OMP_NUM_THREADS
を設定します.スレッド数を1,2,4,8で実行しました.
$ export OMP_NUM_THREADS=1
$ ./a.out
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
----------------------------------------------
STREAM Version $Revision: 5.6 $
----------------------------------------------
Array size = 80000000
Offset = 0
The total memory requirement is 1831 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------
Number of Threads = 1
----------------------------------------------
Printing one line per active thread....
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (GB/s) Avg time Min time Max time
Copy: 320.4268 0.0040 0.0040 0.0040
Scale: 316.4575 0.0041 0.0040 0.0041
Add: 296.4091 0.0065 0.0065 0.0065
Triad: 331.0094 0.0058 0.0058 0.0058
----------------------------------------------------
Solution Validates!
----------------------------------------------------
$ export OMP_NUM_THREADS=2
$ ./a.out
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
----------------------------------------------
STREAM Version $Revision: 5.6 $
----------------------------------------------
Array size = 80000000
Offset = 0
The total memory requirement is 1831 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------
Number of Threads = 2
----------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (GB/s) Avg time Min time Max time
Copy: 553.4130 0.0023 0.0023 0.0023
Scale: 553.4364 0.0023 0.0023 0.0023
Add: 577.8854 0.0033 0.0033 0.0033
Triad: 580.4489 0.0033 0.0033 0.0033
----------------------------------------------------
Solution Validates!
----------------------------------------------------
$ export OMP_NUM_THREADS=4
$ ./a.out
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
----------------------------------------------
STREAM Version $Revision: 5.6 $
----------------------------------------------
Array size = 80000000
Offset = 0
The total memory requirement is 1831 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------
Number of Threads = 4
----------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (GB/s) Avg time Min time Max time
Copy: 711.2026 0.0018 0.0018 0.0018
Scale: 712.0048 0.0018 0.0018 0.0018
Add: 726.4547 0.0026 0.0026 0.0026
Triad: 721.9842 0.0027 0.0027 0.0027
----------------------------------------------------
Solution Validates!
----------------------------------------------------
$ export OMP_NUM_THREADS=8
$ ./a.out
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
----------------------------------------------
STREAM Version $Revision: 5.6 $
----------------------------------------------
Array size = 80000000
Offset = 0
The total memory requirement is 1831 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------
Number of Threads = 8
----------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (GB/s) Avg time Min time Max time
Copy: 701.7189 0.0018 0.0018 0.0018
Scale: 698.7224 0.0018 0.0018 0.0018
Add: 713.1366 0.0027 0.0027 0.0027
Triad: 719.1415 0.0027 0.0027 0.0027
----------------------------------------------------
Solution Validates!
----------------------------------------------------
スレッド数が4以上で転送レートが700 GB/sを越えるようになり,カタログ値0.75 TB/sに近い値が得られています.
自動並列化
nfortには,OpenMPを使わず自動ベクトル化を行うオプションがあります.
$ nfort -mparallel stream.f mysecond.f90
実行結果を見ると,OpenMPは有効化されていないのでスレッド数は表示されませんが,OpenMP 8スレッド実行と同じような結果が得られました.
$ ./a.out
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
----------------------------------------------
STREAM Version $Revision: 5.6 $
----------------------------------------------
Array size = 80000000
Offset = 1
The total memory requirement is 1831 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------
----------------------------------------------
Printing one line per active thread....
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (GB/s) Avg time Min time Max time
Copy: 697.7239 0.0018 0.0018 0.0018
Scale: 697.5061 0.0018 0.0018 0.0018
Add: 721.0836 0.0027 0.0027 0.0027
Triad: 719.9239 0.0027 0.0027 0.0027
----------------------------------------------------
Solution Validates!
----------------------------------------------------
まとめ
SX-Aurora TSUBASA上でstream ベンチマークをやってみました.カタログ値に近い結果が得られており,メモリバンド幅律速なアプリケーションで高い性能が得られそうだと期待できます.