LoginSignup
0
0

More than 3 years have passed since last update.

Kubeflow 1.0 on AWS #3 TF-JOBの実行

Posted at

はじめに

これは、Kubeflow 1.0 をAWSで構築する記事です。
動作確認が主な目的ですので、本番環境での利用は全く想定していません。

前回まで

Kubeflow 1.0 on AWS #2 Notebook作成

今回の内容

exampleのTFJOBを実行して、最低限の動きができていることを確認します

参考資料

共有ストレージEFSの用意

データ置き場としてEFSを利用します。S3を使う方法もあるかと思いますが、それはあとでやってみようと思います。
こちらの通りにやりました
https://qiita.com/asahi0301/items/1116c1f030db3136ff49

efs-sc(storageclass),efs-pv(PV),efs-clain(PVC)を namespace anonymous上に作成しました

k apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

cat <<EOF > efs.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-xxxxx ## ここを自分の環境の値に変更する
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 5Gi
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
EOF

k apply -n anonymous -f efs.yaml
persistentvolume/efs-pv created
persistentvolumeclaim/efs-claim created
storageclass.storage.k8s.io/efs-sc created
#確認
k get pvc -n anonymous | grep efs
efs-claim         Bound    efs-pv                                     5Gi        RWX            efs-sc         117s

tfjob(シングルワーカー)

Jobの実行

tensorflow with mnist のトレーニングを動かします。
ポイントは、 `sidecar.istio.io/inject: "false" で sidecar injectionを無効にすることです。
これがないと、traingingが終わってTensorflowのコンテナが停止しても、envoyが動いているため、tfjobは永久にrunningのままになります

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "mnist"
  namespace: anonymous
spec:
  cleanPodPolicy: None 
  tfReplicaSpecs:
    Worker:
      replicas: 1 
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"
                - "--log_dir=/train/logs"
                - "--learning_rate=0.01"
                - "--batch_size=150"
              volumeMounts:
                - mountPath: "/train"
                  name: "training"
          volumes:
            - name: "training"
              persistentVolumeClaim:
                claimName: "efs-claim"    
k apply -f tf_job_mnist.yaml 
tfjob.kubeflow.org/mnist created

確認

k -n anonymous get tfjobs
NAME    STATE     AGE
mnist   Running   16m

k -n anonymous get pod
NAME             READY   STATUS    RESTARTS   AGE
mnist-worker-0   2/2     Running   0          15s

k -n anonymous logs -f mnist-worker-0 tensorflow
WARNING:tensorflow:From /var/tf_mnist/mnist_with_summaries.py:39: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future ver
sion.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please use urllib or similar directly.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1164
Accuracy at step 10: 0.777
Accuracy at step 20: 0.8484
Accuracy at step 30: 0.8958
Accuracy at step 40: 0.9104
Accuracy at step 50: 0.9235
Accuracy at step 60: 0.9296
Accuracy at step 70: 0.9308
Accuracy at step 80: 0.9347
Accuracy at step 90: 0.9348
Adding run metadata for 99
Accuracy at step 100: 0.9388
Accuracy at step 110: 0.9457
Accuracy at step 120: 0.9472
Accuracy at step 130: 0.9491
Accuracy at step 140: 0.9486
Accuracy at step 150: 0.9493
Accuracy at step 160: 0.9532
Accuracy at step 170: 0.9497
Accuracy at step 180: 0.9489
Accuracy at step 190: 0.9545
Adding run metadata for 199
(続く)

終了の確認

k  get tfjobs -n anonymous
NAME    STATE       AGE
mnist   Succeeded   6m57s

EFSの確認

efs中身確認用のpodを用意

同じpvcを使い回せばなんでもいいのですが、例えばこんなdeploymentをつくります

yaml|test-eks-toolkit-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eks-toolkit-deployment
  namespace: anonymous
  labels:
    app: eks-toolkit
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eks-toolkit
  template:
    metadata:
      labels:
        app: eks-toolkit
    spec:
      containers:
      - name: eks-toolkit
        image: asahi0301/eks-toolkit
        command: ["tail"]
        args: ["-f", "/dev/null"] 
        volumeMounts:
        - mountPath: "/data"
          name: "data"
      volumes:
      - name: "data"
        persistentVolumeClaim:
          claimName: "efs-claim" 

確認

Podのシェルに入って、EFSがマウントされていることを確認する

k apply -f eks-toolkit-deployment.yaml

k -n anonymous exec -it eks-toolkit-deployment-7f699fd967-8jfvm bash
Defaulting container name to eks-toolkit.
Use 'kubectl describe pod/eks-toolkit-deployment-7f699fd967-8jfvm -n anonymous' to see all of the containers in this pod.
bash-4.2# 
bash-4.2# 
bash-4.2# ls
bash-4.2# df
Filesystem                                       1K-blocks    Used        Available Use% Mounted on
overlay                                           20959212 4965980         15993232  24% /
tmpfs                                                65536       0            65536   0% /dev
tmpfs                                              3932516       0          3932516   0% /sys/fs/cgroup
fs-xxxx.efs.us-west-2.amazonaws.com:/ 9007199254739968   23552 9007199254716416   1% /data
/dev/nvme0n1p1                                    20959212 4965980         15993232  24% /etc/hosts
shm                                                  65536    4772            60764   8% /dev/shm
tmpfs                                              3932516      12          3932504   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                              3932516       0          3932516   0% /proc/acpi
tmpfs                                              3932516       0          3932516   0% /sys/firmware

EFSの中身をみてみるろ、ログが保存されていることが分かります

bash-4.2# pwd
/data/logs
bash-4.2# ls
test  train
bash-4.2# 

TFJOB(分散学習)

yamlの用意

サンプルコードの分散学習を試してみます

yaml|tf_job_dist_mnist.yaml
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "dist-mnist-pct"
  namespace: anonymous
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: emsixteeen/tf-dist-mnist-test:1.0
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: emsixteeen/tf-dist-mnist-test:1.0

実行

k apply -f tf_job_dist_mnist.yaml 

確認


k -n anonymous get tfjobs
NAME             STATE       AGE
dist-mnist-pct   Succeeded   3m41s

まとめ

TFJOBを使ったtrainingを行ってみました。
シングルワーカー、分散学習なども試してみてみました。
ストレージはEFSを共有ストレージとして利用しましたが、近々S3で試して見たいと思います

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0