More than 5 years have passed since last update.

Kubernets で　GPU を使って tensorflow を学習させる（３）

Last updated at 2017-12-09Posted at 2017-12-09

３つめは自分もまだまだ消化不良だが、TfJobを使った分散学習について。自分へのメモ

最強チュートリアルを貼っておく。これを基本フォローしている。

wbuchwalter/tensorflow-k8s-azure

ポイントは二つで

・モデルのコードを変更する
・TfJobの記述をマスター・ワーカー・パラメータサーバの構成にする

という感じ。

モデルのコード変更

下記のコードを追加する。基本的には、
　

TF_CONFIG 環境変数を取得
それを元にクラスタの設定をする
それを元にサーバーの設定をする
マスターかどうかの判別をする

マスターは、セッションを作ったり、サマリーをセーブしたりする。（ワーカーはしない）

main.py


  tf_config_json = os.environ.get("TF_CONFIG", "{}")
  tf_config = json.loads(tf_config_json)

  task = tf_config.get("task", {})
  cluster_spec = tf_config.get("cluster", {})
  cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
  job_name = task["type"]
  task_id = task["index"]
  server_def = tf.train.ServerDef(
      cluster=cluster_spec_object.as_cluster_def(),
      protocol="grpc",
      job_name=job_name,
      task_index=task_id)
  server = tf.train.Server(server_def)

  is_chief = (job_name == 'master')

YAML の方では、マスター、ワーカー、パラメータサーバーを指定する。マスターの方のみ、AzureFileの設定を行う。

module6-ex2-gpu.yaml

apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  name: module6-ex2
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          volumes:
            - name: azurefile
              azureFile:
                  secretName: azure-secret
                  shareName: acsshare
                  readOnly: false
          containers:
            - image: tsuyoshiushio/minstexp:gpu
              name: tensorflow
              resources:
                requests:
                  alpha.kubernetes.io/nvidia-gpu: 1
              volumeMounts:
                - mountPath: /tmp/tensorflow
                  subPath: module6-ex2 # Again we isolate the logs in a new directory on Azure Files
                  name: azurefile
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: tsuyoshiushio/minstexp:gpu
              name: tensorflow
              resources:
                requests:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS
  tensorboard:
    logDir: /tmp/tensorflow/logs
    serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
    volumes:
      - name: azurefile
        azureFile:
            secretName: azure-secret
            shareName: acsshare
    volumeMounts:
      - mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
        subPath: module6-ex2 # This should match the directory our Master is actually writing in
        name: azurefile

実行結果

$ kubectl get jobs
NAME                                 DESIRED   SUCCESSFUL   AGE
module7-tf-paint-0-0-master-qdna-0   1         1            4h
module7-tf-paint-0-1-master-yuxn-0   1         1            4h
module7-tf-paint-0-2-master-eytn-0   1         1            4h
module7-tf-paint-1-0-master-eeie-0   1         1            4h
module7-tf-paint-1-1-master-xtmz-0   1         1            4h
module7-tf-paint-1-2-master-ffr4-0   1         1            4h
module7-tf-paint-2-0-master-evil-0   1         1            4h
module7-tf-paint-2-1-master-3mza-0   1         1            4h
module7-tf-paint-2-2-master-vab2-0   1         1            4h

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Kubernets で GPU を使って tensorflow を学習させる（３）

モデルのコード変更

実行結果

Kubernets で　GPU を使って tensorflow を学習させる（３）