More than 5 years have passed since last update.

学習時間評価その3 (NVIDIA Tesla V100) [TensorFlowでDeep Learning 18]

Last updated at 2017-11-16Posted at 2017-11-16

　はじめに

2017年現在で、AWSでDeep Learning用途に使えそうなインスタンスタイプはこちら。(価格は、us-east-1)

	GPU Model	GPU Architecture	GPUs	GPU Memory	CPUs	Main Memory	$/hour
p2.xlarge	Tesla K80	Kepler	1	12	4	61	$0.90
p2.8xlarge	Tesla K80	Kepler	8	96	32	488	$7.20
p2.16xlarge	Tesla K80	Kepler	16	192	64	732	$14.40
p3.2xlarge	Tesla V100	Volta	1	16	8	61	$3.06
p3.8xlarge	Tesla V100	Volta	4	64	32	244	$12.24
p3.16xlarge	Tesla V100	Volta	8	128	64	488	$24.48
g3.4xlarge	Tesla M60	Maxwell	1	8	16	122	$1.14
g3.8xlarge	Tesla M60	Maxwell	2	16	32	244	$2.28
g3.16xlarge	Tesla M60	Maxwell	4	32	64	488	$4.56

この中から、

p2.xlarge
p2.8xlarge
p3.2xlarge
p3.8xlarge
g3.4xlarge

を試したので、その記録。

AWS GPU インスタンスにおける ChainerMN の分散効率に対する評価という立派な投稿があるのは認識しているが、よく使っている、Tensorflow + Inception-v3を試したかった。

環境

Ubuntu 16.04
python 3.5.2
Tensorflow 1.4.0
- 前回の記事でBuildしたもの
CUDA 9.0
cuDNN 7.0.3

対象

Inception-v3
https://github.com/tensorflow/models

データ

手元にある数十万画像

パラメータ

バッチサイズ: 32images/GPU

まとめ

	GPUs	$/hour	images/sec (train)	images/sec (prediction)
p2.xlarge	1	$0.90	23	85
p2.8xlarge	8	$7.20	158	-
p3.2xlarge	1	$3.06	120	460
p3.8xlarge	4	$12.24	300	-
g3.4xlarge	1	$1.14	30	114

GPU x 1の比較だと、
- p3がp2の5倍以上速くて、3.4倍の価格差はリーズナブル
- g3もp2より速くて、これはもしかしたらGPUのアーキテクチャだけじゃなくてCPU数にも関係しているのかも?(augmentationやら各種前処理がCPUなので)
  - ただし、g3のGPUメモリは、8GB
GPU複数の場合の比較だと
- p2.8xlargeに比べ、p3.8xlargeは、GPU数半分だが、約2倍さばいている
  - p3は、たまに350images/secを超えたりと安定しなかったのでもっと検証時間取れば差は広がるかも
- 単位時間あたりの処理量を考えると、こちらもp3のほうがリーズナブルな価格
p3のGPUメモリは16GBであるためこちらのメリットもある

要するに、p3インスタンスに乗り換えるべき
1年前のリリースからp2インスタンスには本当にお世話になっていて超感謝している
これなかったら購入するという選択肢になり、世代古くなっても、もったいなくて、古いまましばらく使うことになっていたはず

以下、処理時間の記録

p2.xlarge

Train

...
2017-11-12 11:58:26.154666: step 50, loss = 10.27 (23.4 examples/sec; 1.368 sec/batch)
2017-11-12 11:58:39.707417: step 60, loss = 10.46 (24.2 examples/sec; 1.322 sec/batch)
2017-11-12 11:58:53.614182: step 70, loss = 10.49 (24.2 examples/sec; 1.321 sec/batch)
2017-11-12 11:59:07.331854: step 80, loss = 10.52 (22.7 examples/sec; 1.407 sec/batch)
2017-11-12 11:59:20.699837: step 90, loss = 10.51 (23.6 examples/sec; 1.354 sec/batch)
2017-11-12 11:59:34.194514: step 100, loss = 10.43 (24.0 examples/sec; 1.334 sec/batch)
...

Prediction

...
2017-11-12 12:01:29.300842: [20 batches out of 361] (65.4 examples/sec; 0.489sec/batch)
2017-11-12 12:01:36.742039: [40 batches out of 361] (86.0 examples/sec; 0.372sec/batch)
2017-11-12 12:01:44.206848: [60 batches out of 361] (85.7 examples/sec; 0.373sec/batch)
2017-11-12 12:01:51.658785: [80 batches out of 361] (85.9 examples/sec; 0.373sec/batch)
2017-11-12 12:01:59.132704: [100 batches out of 361] (85.6 examples/sec; 0.374sec/batch)
...

p2.8xlarge

...
2017-11-12 12:10:01.230087: step 50, loss = 9.85 (158.0 examples/sec; 1.621 sec/batch)
2017-11-12 12:10:17.320288: step 60, loss = 10.15 (158.9 examples/sec; 1.612 sec/batch)
2017-11-12 12:10:33.440182: step 70, loss = 10.05 (158.5 examples/sec; 1.615 sec/batch)
2017-11-12 12:10:49.687708: step 80, loss = 10.08 (158.4 examples/sec; 1.616 sec/batch)
2017-11-12 12:11:05.788696: step 90, loss = 9.97 (153.6 examples/sec; 1.667 sec/batch)
2017-11-12 12:11:22.081270: step 100, loss = 10.02 (156.5 examples/sec; 1.636 sec/batch)
...

p3.2xlarge

Train

...
2017-11-12 12:04:15.472686: step 50, loss = 10.44 (120.1 examples/sec; 0.267 sec/batch)
2017-11-12 12:04:18.148277: step 60, loss = 10.87 (119.2 examples/sec; 0.268 sec/batch)
2017-11-12 12:04:20.847246: step 70, loss = 10.57 (115.9 examples/sec; 0.276 sec/batch)
2017-11-12 12:04:23.542208: step 80, loss = 10.45 (121.5 examples/sec; 0.263 sec/batch)
2017-11-12 12:04:26.242949: step 90, loss = 10.57 (108.9 examples/sec; 0.294 sec/batch)
2017-11-12 12:04:28.954628: step 100, loss = 10.28 (121.2 examples/sec; 0.264 sec/batch)
...

Prediction

...
2017-11-12 12:06:46.337320: [20 batches out of 361] (203.4 examples/sec; 0.157sec/batch)
2017-11-12 12:06:47.764470: [40 batches out of 361] (448.5 examples/sec; 0.071sec/batch)
2017-11-12 12:06:49.163141: [60 batches out of 361] (457.6 examples/sec; 0.070sec/batch)
2017-11-12 12:06:50.522376: [80 batches out of 361] (470.9 examples/sec; 0.068sec/batch)
2017-11-12 12:06:51.864781: [100 batches out of 361] (476.8 examples/sec; 0.067sec/batch)
...

p3.8xlarge

Train

...
2017-11-12 12:57:38.765977: step 50, loss = 10.26 (246.8 examples/sec; 0.519 sec/batch)
2017-11-12 12:57:43.982571: step 60, loss = 10.42 (239.2 examples/sec; 0.535 sec/batch)
2017-11-12 12:57:48.856624: step 70, loss = 10.42 (297.2 examples/sec; 0.431 sec/batch)
2017-11-12 12:57:53.047382: step 80, loss = 10.29 (262.7 examples/sec; 0.487 sec/batch)
2017-11-12 12:57:57.801625: step 90, loss = 10.28 (259.2 examples/sec; 0.494 sec/batch)
2017-11-12 12:58:02.835147: step 100, loss = 10.28 (317.7 examples/sec; 0.403 sec/batch)
2017-11-12 12:58:08.852637: step 110, loss = 10.23 (307.0 examples/sec; 0.417 sec/batch)
2017-11-12 12:58:13.527606: step 120, loss = 10.31 (235.5 examples/sec; 0.544 sec/batch)
2017-11-12 12:58:18.482038: step 130, loss = 10.32 (279.6 examples/sec; 0.458 sec/batch)
2017-11-12 12:58:22.696454: step 140, loss = 10.56 (307.3 examples/sec; 0.416 sec/batch)
...

g3.4xlarge

Train

...
2017-11-12 12:21:40.593635: step 50, loss = 10.46 (30.8 examples/sec; 1.039 sec/batch)
2017-11-12 12:21:50.991231: step 60, loss = 10.53 (30.7 examples/sec; 1.042 sec/batch)
2017-11-12 12:22:01.384176: step 70, loss = 10.39 (30.8 examples/sec; 1.039 sec/batch)
2017-11-12 12:22:11.773941: step 80, loss = 10.73 (30.9 examples/sec; 1.037 sec/batch)
2017-11-12 12:22:22.164275: step 90, loss = 10.37 (30.8 examples/sec; 1.039 sec/batch)
2017-11-12 12:22:32.557663: step 100, loss = 10.47 (30.8 examples/sec; 1.040 sec/batch)
...

Prediction

...
2017-11-12 12:25:36.470306: [20 batches out of 361] (85.4 examples/sec; 0.375sec/batch)
2017-11-12 12:25:42.063093: [40 batches out of 361] (114.4 examples/sec; 0.280sec/batch)
2017-11-12 12:25:47.645907: [60 batches out of 361] (114.6 examples/sec; 0.279sec/batch)
2017-11-12 12:25:53.237244: [80 batches out of 361] (114.5 examples/sec; 0.280sec/batch)
2017-11-12 12:25:58.831566: [100 batches out of 361] (114.4 examples/sec; 0.280sec/batch)
...

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

学習時間評価 その3 (NVIDIA Tesla V100) [TensorFlowでDeep Learning 18]

はじめに

環境

対象

データ

パラメータ

まとめ

以下、処理時間の記録

p2.xlarge

Train

Prediction

p2.8xlarge

p3.2xlarge

Train

Prediction

p3.8xlarge

Train

g3.4xlarge

Train

Prediction

学習時間評価その3 (NVIDIA Tesla V100) [TensorFlowでDeep Learning 18]

　はじめに