More than 5 years have passed since last update.

Keras、NVIDIA等の環境構築をAnsibleで自動構築

Last updated at 2018-09-19Posted at 2018-09-18

はじめに

AWS EC2、オンプレ環境でKerasの環境構築を繰り返しています。
マシンが10〜100台となる計画があり、Ansibleで自動構築をしました。
Install TensorFlow on UbuntuのTensorFlow GPU supportを参考にしています。

自動構築の概要

hostsのIPアドレスを修正して、ansible-playbookで自動構築できます!!!
一応、流れを把握して頂いた後、自動実行して頂ければと思います

$ vim hosts
[keras]
xxx.xxx.xxx.xxx
yyy.yyy.yyy.yyy
zzz.zzz.zzz.zzz

$ ansible-playbook keras.yaml

前提

Ansibleの基本的な解説はしません
Ubuntu Server 16.04 LTS
Python3
Tesla K80

解説

NVIDIAドライバ

384.x or higherが要求条件です
Ubuntu Server 16.04 LTSのnvidia-384を利用します

CUDA

CUDA 9.0が要求条件です
cuda-cublas-9-0の様にバージョンを指定しています

cuDNN

cuDNN 7.2が要求条件です
従来、NVIDIAのサイトでアカウントを作成、パッケージをダウンロード、インストールをしていました
実は、NVIDIAの下記の機械学習リポジトリを登録する事で、上記の手動作業がなくなります
http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

NCCL

NCCLは、NVIDIA Collective Communications Libraryと言う、マルチGPU、マルチノード対応のライブラリです
TensorFlow GPU supportでは、オプションとされてますが、確認のため入れました
このパッケージも上記の機械学習リポジトリからインストール出来ます

TensorRT

TensorRTは、推論の高速化のランタイムです
TensorFlow GPU supportでは、オプションとされてますが、確認のため入れました
cuda9.0をインストールすると、TensorRTをインストール出来る様になりますが、下記の機械学習リポジトリの登録と同等です
http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0_1-1_amd64.deb

Ansible

ファイル一覧です。IPアドレスの変更のみでOKです。

keras.yaml

NVIDIAドライバの設定のため、rc.localを修正したものをコピーしてますので、別途解説します

---
- hosts: keras
  become: true
  tasks:
    - name: Install nvidia driver
      apt:
        name: nvidia-384
        update_cache: yes
    - name: Copy rc.local
      copy:
        src: rc.local
        dest: /etc/rc.local
        mode: 0755
    - name: Add nvidia key
      apt_key:
        url: http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
    - name: Install nvidia repos
      apt:
        deb: "{{ item }}"
      with_items:
        - http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.2.148-1_amd64.deb
        - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
        - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0_1-1_amd64.deb
    - name: Install nvidia CUDA, cuDNN, CUPTI, NCCL, TensorRT
      apt:
        name: "{{ item }}"
        update_cache: yes
      with_items:
        - cuda-command-line-tools-9-0
        - cuda-cublas-9-0
        - cuda-cufft-9-0
        - cuda-curand-9-0
        - cuda-cusolver-9-0
        - cuda-cusparse-9-0
        - libcudnn7=7.2.1.38-1+cuda9.0
        - libnccl2=2.2.13-1+cuda9.0
        - libnvinfer4=4.1.2-1+cuda9.0
    - name: Install python3-pip
      apt:
        name: python3-pip
    - name: Install TensorFlow, Keras
      pip:
        name: "{{ item }}"
      with_items:
        - tensorflow-gpu
        - keras
    - name: Download Keras MNIST CNN
      get_url:
        url: https://raw.githubusercontent.com/keras-team/keras/master/examples/mnist_cnn.py
        dest: /home/ubuntu/mnist_cnn.py
        mode: 0644
        owner: ubuntu
        group: ubuntu

rc.local

nvidia-smi -pm 1は、設定の永続化のためです
nvidia-smi --auto-boost-default=0は、自動ブースト機能の無効化のためです
nvidia-smi -ac 2505,875は、K80のGPUクロック速度を最大周波数にするためです
参考にV100とP100も記載してます

# !/bin/sh -e
#
# rc.local
#
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
#
# In order to enable or disable this script just change the execution
# bits.
#
# By default this script does nothing.

nvidia-smi -pm 1
nvidia-smi --auto-boost-default=0
# V100
# nvidia-smi -ac 877,1530
# P100
# nvidia-smi -ac 715,1328
# K80
nvidia-smi -ac 2505,875

exit 0

/etc/ansible/ansible.cfg

下記に変更してます

host_key_checking = False

/etc/ansible/hosts

xxx.xxx.xxx.xxx等は、適当なIPアドレスに変更してください
Pythonは、Python3にしてます
ユーザ名は、ubuntuにしてます

[keras]
xxx.xxx.xxx.xxx
yyy.yyy.yyy.yyy
zzz.zzz.zzz.zzz

[keras:vars]
ansible_python_interpreter=/usr/bin/python3
ansible_user=ubuntu

ansible-playbook

自動構築を実施します

$ ansible-playbook keras.yaml

動作確認

サーバをリブートした後、下記で確認

Keras MNIST CNN

$ python3 mnist_cnn.py
Using TensorFlow backend.
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 4s 0us/step
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2018-09-18 16:45:15.263356: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-18 16:45:15.412813: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-18 16:45:15.413209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-09-18 16:45:15.413237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-18 16:45:16.921143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-18 16:45:16.921187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-09-18 16:45:16.921202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-09-18 16:45:16.922123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
60000/60000 [==============================] - 12s 200us/step - loss: 0.2574 - acc: 0.9199 - val_loss: 0.0546 - val_acc: 0.9826
Epoch 2/12
60000/60000 [==============================] - 8s 129us/step - loss: 0.0871 - acc: 0.9745 - val_loss: 0.0403 - val_acc: 0.9858
Epoch 3/12
60000/60000 [==============================] - 8s 129us/step - loss: 0.0665 - acc: 0.9806 - val_loss: 0.0358 - val_acc: 0.9885
Epoch 4/12
60000/60000 [==============================] - 8s 129us/step - loss: 0.0550 - acc: 0.9834 - val_loss: 0.0311 - val_acc: 0.9896
Epoch 5/12
60000/60000 [==============================] - 8s 130us/step - loss: 0.0481 - acc: 0.9856 - val_loss: 0.0313 - val_acc: 0.9896
Epoch 6/12
60000/60000 [==============================] - 8s 130us/step - loss: 0.0425 - acc: 0.9869 - val_loss: 0.0279 - val_acc: 0.9908
Epoch 7/12
60000/60000 [==============================] - 8s 130us/step - loss: 0.0373 - acc: 0.9884 - val_loss: 0.0272 - val_acc: 0.9905
Epoch 8/12
60000/60000 [==============================] - 8s 130us/step - loss: 0.0351 - acc: 0.9892 - val_loss: 0.0248 - val_acc: 0.9918
Epoch 9/12
60000/60000 [==============================] - 8s 130us/step - loss: 0.0316 - acc: 0.9902 - val_loss: 0.0270 - val_acc: 0.9918
Epoch 10/12
60000/60000 [==============================] - 8s 130us/step - loss: 0.0304 - acc: 0.9911 - val_loss: 0.0251 - val_acc: 0.9916
Epoch 11/12
60000/60000 [==============================] - 8s 130us/step - loss: 0.0292 - acc: 0.9910 - val_loss: 0.0259 - val_acc: 0.9914
Epoch 12/12
60000/60000 [==============================] - 8s 129us/step - loss: 0.0272 - acc: 0.9917 - val_loss: 0.0301 - val_acc: 0.9915
Test loss: 0.03007666835412515
Test accuracy: 0.9915

nvidia-smi

nvidia-smiでGPUの使用率等をチェック

$ watch -n 0.1 nvidia-smi
Every 0.1s: nvidia-smi                                                                                                                                                                                                                                  Tue Sep 18 16:45:35 2018

Tue Sep 18 16:45:35 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   70C    P0   126W / 149W |  10959MiB / 11439MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3238      C   python3                                    10946MiB |
+-----------------------------------------------------------------------------+

おわりに

今まで、cuDNNは、アカウント作成、ログイン、ダウンロード、インストールと言う流れでしたが、機械学習リポジトリのお陰で、自動構築が叶いました
NCCLやTensorRTも手動作業が不要になるのは、嬉しい発見でした
今後は、NCCLを活用したHorovodの環境構築を目指す予定です

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up