More than 5 years have passed since last update.

nvidia-dockerのインストールと、同一マシン上での複数バージョンのCUDAの使い分け方（Ubuntu14.04/16.04/18.04向け）

Last updated at 2019-10-23Posted at 2019-04-02

nvidia-dockerのインストール

Ubuntu14.04/16.04/18.04ユーザーを想定しています。
それ以外のOSを使われている方は、公式サイトをご確認ください。

公式サイト

conda と dockerが相性が悪いようなので、 condaが入っていない環境で行うことをおすすめします。

nvidia-dockerとは

Deep Learning用に、事前にドライバーなどの環境が用意されたDockerのこと
nvidia-dockerがあれば、CUDAバージョン違いのプログラムが１台のサーバー上で動かすことが可能

例：
TensorFlow 1.12.0はCUDA9.0
TensorFlow 1.13.0はCUDA10.0 対応等に対応可能

nvidia-driveのインストール

ライブラリとバージョンによって頻繁に変わるので、ライブラリの公式サイトを確認してください。

▼TensorFlow公式サイト
https://www.tensorflow.org/install/gpu

Dockerのインストール

Dockerをインストールしていない場合、Ubuntuの場合、以下のリンクを参考にしてインストール
以下は抜粋
https://docs.docker.com/install/linux/docker-ce/ubuntu/

古いバージョンのアンインストール

.sh

$ sudo apt-get remove docker docker-engine docker.io containerd runc

必要なパッケージをインストール

.sh

$ sudo apt-get update
$ sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

DockerオフィシャルのGPG Keyを追加

.sh

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

GPG Keyの末尾８桁の確認

.sh

$ sudo apt-key fingerprint 0EBFCD88
    
pub   rsa4096 2017-02-22 [SCEA]
      9DC8 5822 9FC7 DD38 854A  E2D8 8D81 803C 0EBF CD88
uid           [ unknown] Docker Release (CE deb) <docker@docker.com>
sub   rsa4096 2017-02-22 [S]

リポジトリからダウンロード

.sh

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

Docker CEのインストール

バージョン指定がない場合、以下のコマンドでインストール

.sh

$ sudo apt-get update
$ sudo apt-get install docker-ce docker-ce-cli containerd.io
$ sudo docker run hello-world

バージョンを指定したい場合は、apt-cache madisonでバージョンを確認


$ apt-cache madison docker-ce

  docker-ce | 5:18.09.1~3-0~ubuntu-xenial | https://download.docker.com/linux/ubuntu  xenial/stable amd64 Packages
  docker-ce | 5:18.09.0~3-0~ubuntu-xenial | https://download.docker.com/linux/ubuntu  xenial/stable amd64 Packages
  docker-ce | 18.06.1~ce~3-0~ubuntu       | https://download.docker.com/linux/ubuntu  xenial/stable amd64 Packages
  docker-ce | 18.06.0~ce~3-0~ubuntu       | https://download.docker.com/linux/ubuntu  xenial/stable amd64 Packages
  ...

その後、バージョンを指定して、インストール

$ sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io

Dockerを sudo コマンドなしで利用する

そのまま使うと、Dockerをsudoコマンドをつけないと、コマンドを叩けません。
以下のコマンドで、Docker操作用の権限を付与して、sudo コマンドを使わなくともたたけるようにします。

$ sudo groupadd docker
$ sudo gpasswd -a $USER docker
$ sudo systemctl restart docker

システムをログアウト／再接続すると、反映されます。

NVIDIA Docker のインストール


# nvidia-docker 1.0 がインストールされていた場合、削除する必要がある。
# インストールしていないのであればこの対応は不要
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

# package repositoriesを追加
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# nvidia-docker2 をインストール
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# nvidia-smi をテストしてドライバーが使えているか確認
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

以下のように nvidia-smiで通常表示されている内容が帰ってきたら成功です！

$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Unable to find image 'nvidia/cuda:9.0-base' locally

9.0-base: Pulling from nvidia/cuda
7b722c1070cd: Pull complete 
5fbf74db61f1: Pull complete 
ed41cb72e5c9: Pull complete 
7ea47a67709e: Pull complete 
52efd3da8bcd: Pull complete 
eea82f174227: Pull complete 
0d7845ca9ae6: Pull complete 
Digest: sha256:6c77adf17b3e0188550afa02f88adc326195d845971a017c2317d0cf88f8b50b
Status: Downloaded newer image for nvidia/cuda:9.0-base
Tue Apr  2 13:42:05 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
| 41%   28C    P8    34W / 250W |    777MiB / 10986MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

他のバージョンのCUDAを利用する場合

以下のリンクから、必要なVersionのCUDAを選択して、<TARGET_VERSION>をタグで指定すれば、
様々なバージョンのDockerを簡単に使うことができます！

$ docker run --runtime=nvidia --rm nvidia/cuda:<TARGET_VERSION> nvidia-smi

▼公式のNVIDIA DockerのDocker Hub
https://hub.docker.com/r/nvidia/cuda/tags

複数のバージョンのコンテナのテスト

CUDA 9.0のコンテナを試します。
バージョンは、Docker Hubを参照して指定してください。

バージョン名について

base: 最小構成
runtime: baseを拡張したもの
devel: runtimeを拡張したもの
開発用・学習用であれば、基本的にdevelを選んでおけば問題ないと思います。(devel にしか　nvccが無いなどの問題がある)

docker run --runtime=nvidia --name cuda90 -d -it -p 8888:8888 -v /home/ubuntu/workspace:/workspace nvidia/cuda:9.0-cudnn7-runtime

コンテナに入ります。

docker exec -it cuda90 /bin/bash

コンテナのUbuntu のセットアップをします。

apt update 
apt -y upgrade
apt install -y wget

ここでは何でも良いのですが、Pythonの実行環境を用意します。
今回はAnacondaを利用します。

公式サイトから、必要なバージョンのインストーラのURLを確認します。
https://www.anaconda.com/distribution/#linux

URLを貼り付けてインストールを行います。

wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
sh Anaconda3-2019.10-Linux-x86_64.sh

インストール完了後環境変数を更新します。

source ~/.bashrc

MNISTを学習させてみましょう。

cd 
cd workspace
mkdir cuda90
cd cuda90

Python 3.6系の環境を用意します。

conda create -n python36_cuda90 python=3.6
conda activate python36_cuda90

仮想環境から出るとき

conda deactivate

ライブラリのインストール

 pip install tensorflow-gpu==1.12.0 sklearn

apt-get install vim
vim train_mnist_tf1_12.py

以下の内容を貼り付けてください

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf


N_DIGITS = 10  # Number of digits.
X_FEATURE = 'x'  # Name of the input feature.


def conv_model(features, labels, mode):
  """2-layer convolution model."""
  # Reshape feature to 4d tensor with 2nd and 3rd dimensions being
  # image width and height final dimension being the number of color channels.
  feature = tf.reshape(features[X_FEATURE], [-1, 28, 28, 1])

  # First conv layer will compute 32 features for each 5x5 patch
  with tf.variable_scope('conv_layer1'):
    h_conv1 = tf.layers.conv2d(
        feature,
        filters=32,
        kernel_size=[5, 5],
        padding='same',
        activation=tf.nn.relu)
    h_pool1 = tf.layers.max_pooling2d(
        h_conv1, pool_size=2, strides=2, padding='same')

  # Second conv layer will compute 64 features for each 5x5 patch.
  with tf.variable_scope('conv_layer2'):
    h_conv2 = tf.layers.conv2d(
        h_pool1,
        filters=64,
        kernel_size=[5, 5],
        padding='same',
        activation=tf.nn.relu)
    h_pool2 = tf.layers.max_pooling2d(
        h_conv2, pool_size=2, strides=2, padding='same')
    # reshape tensor into a batch of vectors
    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])

  # Densely connected layer with 1024 neurons.
  h_fc1 = tf.layers.dense(h_pool2_flat, 1024, activation=tf.nn.relu)
  h_fc1 = tf.layers.dropout(
      h_fc1, 
      rate=0.5, 
      training=(mode == tf.estimator.ModeKeys.TRAIN))

  # Compute logits (1 per class) and compute loss.
  logits = tf.layers.dense(h_fc1, N_DIGITS, activation=None)

  # Compute predictions.
  predicted_classes = tf.argmax(logits, 1)
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {
        'class': predicted_classes,
        'prob': tf.nn.softmax(logits)
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  # Compute loss.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  # Create training op.
  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
    train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

  # Compute evaluation metrics.
  eval_metric_ops = {
      'accuracy': tf.metrics.accuracy(
          labels=labels, predictions=predicted_classes)
  }
  return tf.estimator.EstimatorSpec(
      mode, loss=loss, eval_metric_ops=eval_metric_ops)


def main(unused_args):
  tf.logging.set_verbosity(tf.logging.INFO)

  ### Download and load MNIST dataset.
  mnist = tf.contrib.learn.datasets.DATASETS['mnist']('/tmp/mnist')
  train_input_fn = tf.estimator.inputs.numpy_input_fn(
      x={X_FEATURE: mnist.train.images},
      y=mnist.train.labels.astype(np.int32),
      batch_size=100,
      num_epochs=None,
      shuffle=True)
  test_input_fn = tf.estimator.inputs.numpy_input_fn(
      x={X_FEATURE: mnist.train.images},
      y=mnist.train.labels.astype(np.int32),
      num_epochs=1,
      shuffle=False)

  ### Linear classifier.
  feature_columns = [
      tf.feature_column.numeric_column(
          X_FEATURE, shape=mnist.train.images.shape[1:])]

  classifier = tf.estimator.LinearClassifier(
      feature_columns=feature_columns, n_classes=N_DIGITS)
  classifier.train(input_fn=train_input_fn, steps=200)
  scores = classifier.evaluate(input_fn=test_input_fn)
  print('Accuracy (LinearClassifier): {0:f}'.format(scores['accuracy']))

  ### Convolutional network
  classifier = tf.estimator.Estimator(model_fn=conv_model)
  classifier.train(input_fn=train_input_fn, steps=200)
  scores = classifier.evaluate(input_fn=test_input_fn)
  print('Accuracy (conv_model): {0:f}'.format(scores['accuracy']))


if __name__ == '__main__':
  tf.app.run()

実行します。

python train_mnist_tf1_12.py

最終的に
Accuracy (conv_model): 0.809927
と、学習が出来ていることが確認できれば、Python3.6, TensorFlow1.12, CUDA 9.0 を使った学習に成功です。

続いてCUDA8.0 のテスト

CUDA9.0コンテナから出て、CUDA8.0のコンテナを作成します。

exit
docker container ls
([container ID]を確認する)
docker container stop [container ID]

バージョンをCUDA8.0系を指定することがポイントです。
cudnn の依存バージョンも、ライブラリの依存を確認して適切なバージョンを指定する必要があります。
（今回は、TensorFlow-gpu==1.1.0を使うため、cudnn5が必要）

docker run --runtime=nvidia --name cuda80 -d -it -p 8888:8888 -v /home/ubuntu/workspace:/workspace nvidia/cuda:8.0-cudnn5-runtime

同様のことをこちらのコンテナでも行います。


docker exec -it cuda80 /bin/bash
apt update 
apt -y upgrade
apt install -y wget
wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
sh Anaconda3-2019.10-Linux-x86_64.sh
source ~/.bashrc

cd 
cd workspace
mkdir cuda80
cd cuda80

conda create -n python36_cuda80 python=3.6
conda activate python36_cuda80

ライブラリのインストール

 pip install tensorflow-gpu==1.1.0 sklearn

apt-get install vim
vim train_mnist_tf1_1.py

以下の内容を貼り付けます。

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
from sklearn import metrics
import tensorflow as tf

layers = tf.contrib.layers
learn = tf.contrib.learn


def max_pool_2x2(tensor_in):
  return tf.nn.max_pool(
      tensor_in, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')


def conv_model(feature, target, mode):
  """2-layer convolution model."""
  # Convert the target to a one-hot tensor of shape (batch_size, 10) and
  # with a on-value of 1 for each one-hot vector of length 10.
  target = tf.one_hot(tf.cast(target, tf.int32), 10, 1, 0)

  # Reshape feature to 4d tensor with 2nd and 3rd dimensions being
  # image width and height final dimension being the number of color channels.
  feature = tf.reshape(feature, [-1, 28, 28, 1])

  # First conv layer will compute 32 features for each 5x5 patch
  with tf.variable_scope('conv_layer1'):
    h_conv1 = layers.convolution2d(
        feature, 32, kernel_size=[5, 5], activation_fn=tf.nn.relu)
    h_pool1 = max_pool_2x2(h_conv1)

  # Second conv layer will compute 64 features for each 5x5 patch.
  with tf.variable_scope('conv_layer2'):
    h_conv2 = layers.convolution2d(
        h_pool1, 64, kernel_size=[5, 5], activation_fn=tf.nn.relu)
    h_pool2 = max_pool_2x2(h_conv2)
    # reshape tensor into a batch of vectors
    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])

  # Densely connected layer with 1024 neurons.
  h_fc1 = layers.dropout(
      layers.fully_connected(
          h_pool2_flat, 1024, activation_fn=tf.nn.relu),
      keep_prob=0.5,
      is_training=mode == tf.contrib.learn.ModeKeys.TRAIN)

  # Compute logits (1 per class) and compute loss.
  logits = layers.fully_connected(h_fc1, 10, activation_fn=None)
  loss = tf.losses.softmax_cross_entropy(target, logits)

  # Create a tensor for training op.
  train_op = layers.optimize_loss(
      loss,
      tf.contrib.framework.get_global_step(),
      optimizer='SGD',
      learning_rate=0.001)

  return tf.argmax(logits, 1), loss, train_op


def main(unused_args):
  ### Download and load MNIST dataset.
  mnist = learn.datasets.load_dataset('mnist')

  ### Linear classifier.
  feature_columns = learn.infer_real_valued_columns_from_input(
      mnist.train.images)
  classifier = learn.LinearClassifier(
      feature_columns=feature_columns, n_classes=10)
  classifier.fit(mnist.train.images,
                 mnist.train.labels.astype(np.int32),
                 batch_size=100,
                 steps=1000)
  score = metrics.accuracy_score(mnist.test.labels,
                                 list(classifier.predict(mnist.test.images)))
  print('Accuracy: {0:f}'.format(score))

  ### Convolutional network
  classifier = learn.Estimator(model_fn=conv_model)
  classifier.fit(mnist.train.images,
                 mnist.train.labels,
                 batch_size=100,
                 steps=20000)
  score = metrics.accuracy_score(mnist.test.labels,
                                 list(classifier.predict(mnist.test.images)))
  print('Accuracy: {0:f}'.format(score))


if __name__ == '__main__':
  tf.app.run()

python train_mnist_tf1_1.py

学習を実行できれば成功です。

ImportError: libcudnn.so.5: cannot open shared object file: No such file or directory

などと出てきた場合、nvidia-docker のコンテナを変える必要があります。

バージョン名をDockerHubで適切なものを探して、Container作成からやり直します。

docker run --runtime=nvidia --name cuda80 -d -it -p 8888:8888 -v /home/ubuntu/workspace:/workspace nvidia/cuda:<バージョン>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up