Edited at

学習時間評価 その3 (NVIDIA Tesla V100) [TensorFlowでDeep Learning 18]

More than 1 year has passed since last update.

(目次はこちら)


 はじめに

2017年現在で、AWSでDeep Learning用途に使えそうなインスタンスタイプはこちら。(価格は、us-east-1)

GPU Model
GPU Architecture
GPUs
GPU Memory
CPUs
Main Memory
$/hour

p2.xlarge
Tesla K80
Kepler
1
12
4
61
$0.90

p2.8xlarge
Tesla K80
Kepler
8
96
32
488
$7.20

p2.16xlarge
Tesla K80
Kepler
16
192
64
732
$14.40

p3.2xlarge
Tesla V100
Volta
1
16
8
61
$3.06

p3.8xlarge
Tesla V100
Volta
4
64
32
244
$12.24

p3.16xlarge
Tesla V100
Volta
8
128
64
488
$24.48

g3.4xlarge
Tesla M60
Maxwell
1
8
16
122
$1.14

g3.8xlarge
Tesla M60
Maxwell
2
16
32
244
$2.28

g3.16xlarge
Tesla M60
Maxwell
4
32
64
488
$4.56

この中から、


  • p2.xlarge

  • p2.8xlarge

  • p3.2xlarge

  • p3.8xlarge

  • g3.4xlarge

を試したので、その記録。

AWS GPU インスタンスにおける ChainerMN の分散効率に対する評価 という立派な投稿があるのは認識しているが、よく使っている、Tensorflow + Inception-v3を試したかった。


環境


  • Ubuntu 16.04

  • python 3.5.2

  • Tensorflow 1.4.0



  • CUDA 9.0

  • cuDNN 7.0.3


対象


データ


  • 手元にある数十万画像


パラメータ


  • バッチサイズ: 32images/GPU


まとめ

GPUs
$/hour
images/sec (train)
images/sec (prediction)

p2.xlarge
1
$0.90
23
85

p2.8xlarge
8
$7.20
158
-

p3.2xlarge
1
$3.06
120
460

p3.8xlarge
4
$12.24
300
-

g3.4xlarge
1
$1.14
30
114


  • GPU x 1の比較だと、


    • p3がp2の5倍以上速くて、3.4倍の価格差はリーズナブル

    • g3もp2より速くて、これはもしかしたらGPUのアーキテクチャだけじゃなくてCPU数にも関係しているのかも?(augmentationやら各種前処理がCPUなので)


      • ただし、g3のGPUメモリは、8GB





  • GPU複数の場合の比較だと


    • p2.8xlargeに比べ、p3.8xlargeは、GPU数半分だが、約2倍さばいている


      • p3は、たまに350images/secを超えたりと安定しなかったのでもっと検証時間取れば差は広がるかも



    • 単位時間あたりの処理量を考えると、こちらもp3のほうがリーズナブルな価格



  • p3のGPUメモリは16GBであるためこちらのメリットもある

要するに、p3インスタンスに乗り換えるべき

1年前のリリースからp2インスタンスには本当にお世話になっていて超感謝している

これなかったら購入するという選択肢になり、世代古くなっても、もったいなくて、古いまましばらく使うことになっていたはず


以下、処理時間の記録


p2.xlarge


Train

...

2017-11-12 11:58:26.154666: step 50, loss = 10.27 (23.4 examples/sec; 1.368 sec/batch)
2017-11-12 11:58:39.707417: step 60, loss = 10.46 (24.2 examples/sec; 1.322 sec/batch)
2017-11-12 11:58:53.614182: step 70, loss = 10.49 (24.2 examples/sec; 1.321 sec/batch)
2017-11-12 11:59:07.331854: step 80, loss = 10.52 (22.7 examples/sec; 1.407 sec/batch)
2017-11-12 11:59:20.699837: step 90, loss = 10.51 (23.6 examples/sec; 1.354 sec/batch)
2017-11-12 11:59:34.194514: step 100, loss = 10.43 (24.0 examples/sec; 1.334 sec/batch)
...


Prediction

...

2017-11-12 12:01:29.300842: [20 batches out of 361] (65.4 examples/sec; 0.489sec/batch)
2017-11-12 12:01:36.742039: [40 batches out of 361] (86.0 examples/sec; 0.372sec/batch)
2017-11-12 12:01:44.206848: [60 batches out of 361] (85.7 examples/sec; 0.373sec/batch)
2017-11-12 12:01:51.658785: [80 batches out of 361] (85.9 examples/sec; 0.373sec/batch)
2017-11-12 12:01:59.132704: [100 batches out of 361] (85.6 examples/sec; 0.374sec/batch)
...


p2.8xlarge

...

2017-11-12 12:10:01.230087: step 50, loss = 9.85 (158.0 examples/sec; 1.621 sec/batch)
2017-11-12 12:10:17.320288: step 60, loss = 10.15 (158.9 examples/sec; 1.612 sec/batch)
2017-11-12 12:10:33.440182: step 70, loss = 10.05 (158.5 examples/sec; 1.615 sec/batch)
2017-11-12 12:10:49.687708: step 80, loss = 10.08 (158.4 examples/sec; 1.616 sec/batch)
2017-11-12 12:11:05.788696: step 90, loss = 9.97 (153.6 examples/sec; 1.667 sec/batch)
2017-11-12 12:11:22.081270: step 100, loss = 10.02 (156.5 examples/sec; 1.636 sec/batch)
...


p3.2xlarge


Train

...

2017-11-12 12:04:15.472686: step 50, loss = 10.44 (120.1 examples/sec; 0.267 sec/batch)
2017-11-12 12:04:18.148277: step 60, loss = 10.87 (119.2 examples/sec; 0.268 sec/batch)
2017-11-12 12:04:20.847246: step 70, loss = 10.57 (115.9 examples/sec; 0.276 sec/batch)
2017-11-12 12:04:23.542208: step 80, loss = 10.45 (121.5 examples/sec; 0.263 sec/batch)
2017-11-12 12:04:26.242949: step 90, loss = 10.57 (108.9 examples/sec; 0.294 sec/batch)
2017-11-12 12:04:28.954628: step 100, loss = 10.28 (121.2 examples/sec; 0.264 sec/batch)
...


Prediction

...

2017-11-12 12:06:46.337320: [20 batches out of 361] (203.4 examples/sec; 0.157sec/batch)
2017-11-12 12:06:47.764470: [40 batches out of 361] (448.5 examples/sec; 0.071sec/batch)
2017-11-12 12:06:49.163141: [60 batches out of 361] (457.6 examples/sec; 0.070sec/batch)
2017-11-12 12:06:50.522376: [80 batches out of 361] (470.9 examples/sec; 0.068sec/batch)
2017-11-12 12:06:51.864781: [100 batches out of 361] (476.8 examples/sec; 0.067sec/batch)
...


p3.8xlarge


Train

...

2017-11-12 12:57:38.765977: step 50, loss = 10.26 (246.8 examples/sec; 0.519 sec/batch)
2017-11-12 12:57:43.982571: step 60, loss = 10.42 (239.2 examples/sec; 0.535 sec/batch)
2017-11-12 12:57:48.856624: step 70, loss = 10.42 (297.2 examples/sec; 0.431 sec/batch)
2017-11-12 12:57:53.047382: step 80, loss = 10.29 (262.7 examples/sec; 0.487 sec/batch)
2017-11-12 12:57:57.801625: step 90, loss = 10.28 (259.2 examples/sec; 0.494 sec/batch)
2017-11-12 12:58:02.835147: step 100, loss = 10.28 (317.7 examples/sec; 0.403 sec/batch)
2017-11-12 12:58:08.852637: step 110, loss = 10.23 (307.0 examples/sec; 0.417 sec/batch)
2017-11-12 12:58:13.527606: step 120, loss = 10.31 (235.5 examples/sec; 0.544 sec/batch)
2017-11-12 12:58:18.482038: step 130, loss = 10.32 (279.6 examples/sec; 0.458 sec/batch)
2017-11-12 12:58:22.696454: step 140, loss = 10.56 (307.3 examples/sec; 0.416 sec/batch)
...


g3.4xlarge


Train

...

2017-11-12 12:21:40.593635: step 50, loss = 10.46 (30.8 examples/sec; 1.039 sec/batch)
2017-11-12 12:21:50.991231: step 60, loss = 10.53 (30.7 examples/sec; 1.042 sec/batch)
2017-11-12 12:22:01.384176: step 70, loss = 10.39 (30.8 examples/sec; 1.039 sec/batch)
2017-11-12 12:22:11.773941: step 80, loss = 10.73 (30.9 examples/sec; 1.037 sec/batch)
2017-11-12 12:22:22.164275: step 90, loss = 10.37 (30.8 examples/sec; 1.039 sec/batch)
2017-11-12 12:22:32.557663: step 100, loss = 10.47 (30.8 examples/sec; 1.040 sec/batch)
...


Prediction

...

2017-11-12 12:25:36.470306: [20 batches out of 361] (85.4 examples/sec; 0.375sec/batch)
2017-11-12 12:25:42.063093: [40 batches out of 361] (114.4 examples/sec; 0.280sec/batch)
2017-11-12 12:25:47.645907: [60 batches out of 361] (114.6 examples/sec; 0.279sec/batch)
2017-11-12 12:25:53.237244: [80 batches out of 361] (114.5 examples/sec; 0.280sec/batch)
2017-11-12 12:25:58.831566: [100 batches out of 361] (114.4 examples/sec; 0.280sec/batch)
...