24
20

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

冬到来! RX470 8GB マイニングエディションと ROCm TensorFlow で, GPU 機械学習をはじめよう(CIFAR10 7,600 examples/sec @ 88W)

Last updated at Posted at 2019-01-09

冬到来!

RX470 と ROCm TensorFlow で GPU 機械学習をはじめよう!

RX470 8GB mem mining 版(中古)が, 税込 6.5 千円ちょっとくらいで買えるので(2019 年 1 月 10 日時点), お手軽に試せるよ!
優秀な TensorFlow 小学生さまにおかれましては, お年玉で買えてしまいますね.

RX470 mining 版はメモリが 8GB で機械学習を始めるのによいのですが, 画面出力が無いので, GPU 内臓の Intel CPU と組み合わせるか, 画面出力用に別 GPU で Linux をセットアップしておこう.

構成

ROCm TensorFlow をインストールする

インストールがんばろう!

GPU 認識状況を確認する

rocminfo でわかるよ

$ /opt/rocm/bin/rocminfo

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0                                  
  Queue Min Size:          0                                  
  Queue Max Size:          0                                  
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768KB                            
  Chip ID:                 0                                  
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):3700                               
  BDFID:                   0                                  
  Compute Unit:            16                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16418944KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16418944KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26751                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1590                               
  BDFID:                   10240                              
  Compute Unit:            56                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  671089664                          
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224KB                          
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx900          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx803                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26591                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1130                               
  BDFID:                   10496                              
  Compute Unit:            32                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  687866880                          
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8388608KB                          
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*** Done ***  

mining edition でも, 通常の RX470(gfx803)として認識できているのが確認できます.
(ただし, 通常の RX470 に比べ, compute core 数は 36 -> 32 と少なめになっているっぽく, 後述の MIOpen 周りで warning が出ます)

rocm-smi で GPU の状態がわかるよ.

$ sudo /opt/rocm/bin/rocm-smi

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     68c    11.0W    991Mhz  700Mhz  8.0GT/s, x16   8.63%   manual  150W     0%        0%       0%       
1     41c    10.162W  751Mhz  300Mhz  8.0GT/s, x16   36.86%  manual  115W     0%        0%       0%       
================================================================================================
========================               End of ROCm SMI Log              ========================

-u で GPU usage が出るよ~

$ sudo /opt/rocm/bin/rocm-smi

========================        ROCm System Management Interface        ========================
================================================================================================
GPU[0] 		: Current GPU use: 0%
GPU[1] 		: Current GPU use: 0%
================================================================================================
========================               End of ROCm SMI Log              ========================

クロックのリストをみてみます.

$ /opt/rocm/bin/rocm-smi -s

GPU[1] 		: Supported GPU clock frequencies on GPU1
GPU[1] 		: 0: 300Mhz 
GPU[1] 		: 1: 466Mhz 
GPU[1] 		: 2: 751Mhz 
GPU[1] 		: 3: 1019Mhz 
GPU[1] 		: 4: 1074Mhz 
GPU[1] 		: 5: 1126Mhz *
GPU[1] 		: 6: 1129Mhz 
GPU[1] 		: 7: 1130Mhz 
GPU[1] 		: 
GPU[1] 		: Supported GPU Memory clock frequencies on GPU1
GPU[1] 		: 0: 300Mhz 
GPU[1] 		: 1: 2000Mhz *
GPU[1] 		: 
GPU[1] 		: Supported PCIE clock frequencies on GPU1
GPU[1] 		: 0: 2.5GT/s, x8 
GPU[1] 		: 1: 8.0GT/s, x16 *
GPU[1] 		: 

CIFAR10 を動かしてみる.

画像認識でよく使われているベンチマークの CIFAR10 を動かしてみます.

今回の構成では, VEGA も差さっているので, RX470(Ellesmere)だけで動かすようにします.

にあるように, 環境変数 CUDA_VISIBLE_DEVICES で, 動かす GPU を指定が ROCm tensorflow でも使えます.

学習を始めると,

MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx803_32.cd.pdb.txt

と, ファイルが見つからない warning がでますが, 動きます(通常 RX470 は gfx803_36.cd.pdb.txt を見に行っている模様)

--setsck 2 のとき

...
2019-01-10 00:58:18.344006: step 68340, loss = 0.74 (5744.6 examples/sec; 0.022 sec/batch)
2019-01-10 00:58:18.566252: step 68350, loss = 0.84 (5759.6 examples/sec; 0.022 sec/batch)
2019-01-10 00:58:18.788830: step 68360, loss = 0.83 (5750.5 examples/sec; 0.022 sec/batch)
2019-01-10 00:58:19.010505: step 68370, loss = 0.73 (5774.4 examples/sec; 0.022 sec/batch)
...
1     52c    65.185W  751Mhz  2000Mhz 8.0GT/s, x16   36.86%  manual  115W     0%        0%       100%

--setsclk 2 だと, 65W で 5,700 examples/sec でした.

--setsck 5 のとき

2019-01-10 00:56:07.237249: step 61170, loss = 0.81 (7683.0 examples/sec; 0.017 sec/batch)
2019-01-10 00:56:07.404545: step 61180, loss = 0.73 (7651.1 examples/sec; 0.017 sec/batch)
2019-01-10 00:56:07.570738: step 61190, loss = 0.76 (7701.9 examples/sec; 0.017 sec/batch)
2019-01-10 00:56:07.817193: step 61200, loss = 0.94 (5193.6 examples/sec; 0.025 sec/batch)
2019-01-10 00:56:07.985480: step 61210, loss = 0.72 (7606.3 examples/sec; 0.017 sec/batch)
1     58c    88.184W  1126Mhz 2000Mhz 8.0GT/s, x16   36.86%  manual  115W     0%        0%       100%  

88 W で 7,600 examples/sec でした.

だいたい NVIDIA GTX 1070 と同じくらいという感じかしらん.

PRNet を動かす

推論だけですが, 画像から 3D 顔形状を復元する PRNet が動くのを確認しました.

waveglow-tensorflow を動かす

waveglow でいい感じにテキストから音声生成してくれる waveglow-tensorflow をうごかそう!

MIOpen 1.7.1 を待つか,

の修正をいれれば動くのを確認しました.

また, もともとの hparms.py での設定では 8GB メモリでは足りないです.
wavnet_channels, wavenet_layers を 256, 7 などに落とすと学習できます.

マイニングしてみる.

機械学習で使っていないときは, マイニングさせてみます. ただ, ROCm なのであまり性能はでないです.

ethminer で ETH をマイニングしてみます.

 m 01:12:18 ethminer 0:04 A3 54.52 Mh { cl0 34.91 | cl1 19.62 }

--setsclk 2 で 19.5 Mh @ 80W でした.
(ちなみに VEGA(cl0)は 34 Mh @ 100W でした)

2019 年 1 月 10 日時点では, 19.6 Mh では ETH のマイニングで 0.0018 erh/day($0.28/day) くらいです.

TODO

  • rocm-smi を sudo 権限なしで実行したい(/etc/sudoers に記述する手もあるが...)
  • 優秀な TesnsorFlow 若人さまが, 人類史上最速で優秀な ROCm + TensorFlow 若人さまへ昇華なされるスキームを確立する旅に出たい
24
20
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
24
20

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?