NEC FortranとSX-Aurora TSUBASAを使ってみた（その3）

Last updated at 2022-12-18Posted at 2022-12-15

概要

SX-Aurora TSUBASA (SXAT)を使える機会を得たので，streamベンチマークをやってみました．カタログスペックに近い値が得られ，非常に期待が持てます．

環境

Vector Engine Type 10C
nfort 3.5.1

stream ベンチマーク

streamベンチマークは，メモリバンド幅と対応する計算速度を評価するためのベンチマークです．主にメモリ帯域の評価に用いられており，NVIDIA GPUのメモリ帯域評価やIntel Xeon Phiのメモリ帯域評価に用いられている記事を見つけることができます．

SX-Aurora TSUBASAでもstream ベンチマークが実行されていることが，シンポジウムでの発表資料pdfから確認できます．

stream ベンチマークは，バージニア大学にあるページ https://www.cs.virginia.edu/stream/ → Source Code Directory から入手できます．ソースファイルは，C言語版stream.cとFORTRAN版stream.fが用意されています．本記事ではFORTRAN版を利用します．

strema ベンチマークのソースをビルドするには，入手したstream.fに加えて，stream.f内で参照されている外部手続mysecond()を実装する必要があります．mysecondsは，wall-clock time（秒）を倍精度実数で返す関数です．Fortranを使って，下記のように実装しました．

mysecond.f90

double precision function mysecond()
    use, intrinsic :: iso_fortran_env
    implicit none

    call cpu_time(mysecond)
end function mysecond

また，stream ベンチマークは，メモリ帯域をMB/s単位で出力することを想定しており，SXATのメモリ帯域を表示するには表示桁が足りません．そのため，stream.fを下記のように修正しました．

stream.f:236

-          WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6,
+          WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D9,

247

- 9040 FORMAT ('Function',5x,'Rate (MB/s)  Avg time   Min time  Max time'
+ 9040 FORMAT ('Function',5x,'Rate (GB/s)  Avg time   Min time  Max time'

また，配列サイズも適当に大きくしておきました．

-      PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
+      PARAMETER (n=80000000,offset=0,ndim=n+offset,ntimes=10)

OpenMP並列化

ビルドにはnfortを用い，OpenMPを有効にします．

$ nfort -fopenmp stream.f mysecond.f90

OpenMPのスレッド数を制御するには，環境変数OMP_NUM_THREADSを設定します．スレッド数を1,2,4,8で実行しました．

$ export OMP_NUM_THREADS=1
$ ./a.out 
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size =   80000000
 Offset     =          0
 The total memory requirement is 1831 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads =  1
 ----------------------------------------------
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (GB/s)  Avg time   Min time  Max time
Copy:        320.4268      0.0040      0.0040      0.0040
Scale:       316.4575      0.0041      0.0040      0.0041
Add:         296.4091      0.0065      0.0065      0.0065
Triad:       331.0094      0.0058      0.0058      0.0058
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

$ export OMP_NUM_THREADS=2
$ ./a.out 
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size =   80000000
 Offset     =          0
 The total memory requirement is 1831 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads =  2
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (GB/s)  Avg time   Min time  Max time
Copy:        553.4130      0.0023      0.0023      0.0023
Scale:       553.4364      0.0023      0.0023      0.0023
Add:         577.8854      0.0033      0.0033      0.0033
Triad:       580.4489      0.0033      0.0033      0.0033
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

$ export OMP_NUM_THREADS=4
$ ./a.out 
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size =   80000000
 Offset     =          0
 The total memory requirement is 1831 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads =  4
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (GB/s)  Avg time   Min time  Max time
Copy:        711.2026      0.0018      0.0018      0.0018
Scale:       712.0048      0.0018      0.0018      0.0018
Add:         726.4547      0.0026      0.0026      0.0026
Triad:       721.9842      0.0027      0.0027      0.0027
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

$ export OMP_NUM_THREADS=8
$ ./a.out 
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size =   80000000
 Offset     =          0
 The total memory requirement is 1831 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads =  8
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (GB/s)  Avg time   Min time  Max time
Copy:        701.7189      0.0018      0.0018      0.0018
Scale:       698.7224      0.0018      0.0018      0.0018
Add:         713.1366      0.0027      0.0027      0.0027
Triad:       719.1415      0.0027      0.0027      0.0027
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

スレッド数が4以上で転送レートが700 GB/sを越えるようになり，カタログ値0.75 TB/sに近い値が得られています．

自動並列化

nfortには，OpenMPを使わず自動ベクトル化を行うオプションがあります．

$ nfort -mparallel  stream.f mysecond.f90

実行結果を見ると，OpenMPは有効化されていないのでスレッド数は表示されませんが，OpenMP 8スレッド実行と同じような結果が得られました．

$ ./a.out 
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size =   80000000
 Offset     =          1
 The total memory requirement is 1831 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 ----------------------------------------------
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (GB/s)  Avg time   Min time  Max time
Copy:        697.7239      0.0018      0.0018      0.0018
Scale:       697.5061      0.0018      0.0018      0.0018
Add:         721.0836      0.0027      0.0027      0.0027
Triad:       719.9239      0.0027      0.0027      0.0027
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

まとめ

SX-Aurora TSUBASA上でstream ベンチマークをやってみました．カタログ値に近い結果が得られており，メモリバンド幅律速なアプリケーションで高い性能が得られそうだと期待できます．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up