More than 5 years have passed since last update.

Kubernets で　GPU を使って tensorflow を学習させる（１）

Last updated at 2017-12-09Posted at 2017-12-08

Azure Container Service と tensorflow を使って、学習をさせる時の設定のポイントを自分のメモとして書いておく。

最強チュートリアルを貼っておく。これを基本フォローしている。

wbuchwalter/tensorflow-k8s-azure

1. GPU 用のベースイメージを使う

最初のポイントは、tensorflow のイメージがあるが、GPU向けのものを使う。CPU用のサイズが400MB 程度なのに対して　GPU用は1G, 2G ぐらいのサイズになるがそんなもの。

tensorflow/tensorflow

2. Kubernetes を　GPUのVMでプロビジョンする

westus2 もしくは uksouth で GPU VM を指定してプロビジョンする。例えば　Standard_NC6。GPUのクラスタのデプロイは通常のより多くかかる(10-15 min) これは、NVIDIAのドライバーをインストールしているから。リージョンによっては、キャパシティが制限されているので、要確認。

こんな感じ。

az group create --name RemoveGPU --location westus2
az acs create --agent-vm-size Standard_NC6 --resource-group RemoveGPU --name gpucluster --orchestrator-type kubernetes --agent-count 2 --location westus2 ```

kubeconfig の取得は下記の通り。

az acs kubernetes get-credentials --name gpucluster --resource-group RemoveGPU


ちなみにこのコマンドだと、既存の kubeconfig が存在する場合は、`~/.kube/config` を消さずクラスタが追加されるので、context を切り替えて新しいクラスタが使えるようになる。

確認方法は

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-agentpool0-03479696-0 Ready agent 21h v1.7.9
k8s-agentpool0-03479696-1 Ready agent 21h v1.7.9
k8s-agentpool0-03479696-2 Ready agent 17h v1.7.9
k8s-master-03479696-0 Ready master 21h v1.7.9
$ kubectl describe node k8s-agentpool0-03479696-0
:
Capacity:
alpha.kubernetes.io/nvidia-gpu: 1
:


ちなみに、エージェントの数を変えたい場合は

az acs scale -g RemoveGPU -n gpucluster --new-agent-count 3


# GPU イメージ向けのyamlファイルの書き方

## GPU イメージの選択

### tensorflow の場合のイメージ

tensorflow/tensorflow:1.4.0-gpu-py3 (GPU)
tensorflow/tensorflow:1.4.0-py3 for (CPU)

### CNTK

microsoft/cntk:2.2-gpu-python3.5-cuda8.0-cudnn6.0 (GPU)
microsoft/cntk:2.2-python3.5 for (CPU) 

## GPUのリクエスト

k8s は　CPUに関してはデフォルトで CPUを確保してくれるが、GPUはそうでは無いので、YAMLにリソースの設定をする。

containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1


こんな感じ。

## ドライバをコンテナにシェアする

NVIDIA のドライバをコンテナから参照するようにセットアップする

    volumeMounts: # Where the drivers should be mounted in the container
    - name: lib
      mountPath: /usr/local/nvidia/lib64
    - name: libcuda
      mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
  volumes: # Where the drivers are located on the node
    - name: lib
      hostPath: 
        path: /usr/lib/nvidia-384
    - name: libcuda
      hostPath:
        path: /usr/lib/x86_64-linux-gnu/libcuda.so.1


# まとめ

GPU のクラスタを作って、下記の３つをセットアップすればOK

* GPU イメージの選択
* GPU　のリクエスト
* NVIDAのドライバをコンテナから参照

次は、k8s で使える便利ツールや、スケール時の考慮など

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Kubernets で GPU を使って tensorflow を学習させる（１）

1. GPU 用のベースイメージを使う

2. Kubernetes を GPUのVMでプロビジョンする

Kubernets で　GPU を使って tensorflow を学習させる（１）

2. Kubernetes を　GPUのVMでプロビジョンする