はじめに
これは、Kubeflow 1.0 をAWSで構築する記事です。
動作確認が主な目的ですので、本番環境での利用は全く想定していません。
前回まで
Kubeflow 1.0 on AWS #2 Notebook作成
今回の内容
exampleのTFJOBを実行して、最低限の動きができていることを確認します
参考資料
- https://www.kubeflow.org/docs/components/training/tftraining/
- https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1/mnist_with_summaries/tfevent-volume/tfevent-pv.yaml
- https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1/mnist_with_summaries/tfevent-volume/tfevent-pvc.yaml
- https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1/mnist_with_summaries/tf_job_mnist.yaml
共有ストレージEFSの用意
データ置き場としてEFSを利用します。S3を使う方法もあるかと思いますが、それはあとでやってみようと思います。
こちらの通りにやりました
https://qiita.com/asahi0301/items/1116c1f030db3136ff49
efs-sc(storageclass),efs-pv(PV),efs-clain(PVC)を namespace anonymous上に作成しました
k apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
cat <<EOF > efs.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-xxxxx ## ここを自分の環境の値に変更する
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 5Gi
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
EOF
k apply -n anonymous -f efs.yaml
persistentvolume/efs-pv created
persistentvolumeclaim/efs-claim created
storageclass.storage.k8s.io/efs-sc created
#確認
k get pvc -n anonymous | grep efs
efs-claim Bound efs-pv 5Gi RWX efs-sc 117s
tfjob(シングルワーカー)
Jobの実行
tensorflow with mnist のトレーニングを動かします。
ポイントは、 ````sidecar.istio.io/inject: "false"``` で sidecar injectionを無効にすることです。
これがないと、traingingが終わってTensorflowのコンテナが停止しても、envoyが動いているため、tfjobは永久にrunningのままになります
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "mnist"
namespace: anonymous
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/logs"
- "--learning_rate=0.01"
- "--batch_size=150"
volumeMounts:
- mountPath: "/train"
name: "training"
volumes:
- name: "training"
persistentVolumeClaim:
claimName: "efs-claim"
k apply -f tf_job_mnist.yaml
tfjob.kubeflow.org/mnist created
確認
k -n anonymous get tfjobs
NAME STATE AGE
mnist Running 16m
k -n anonymous get pod
NAME READY STATUS RESTARTS AGE
mnist-worker-0 2/2 Running 0 15s
k -n anonymous logs -f mnist-worker-0 tensorflow
WARNING:tensorflow:From /var/tf_mnist/mnist_with_summaries.py:39: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future ver
sion.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please use urllib or similar directly.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1164
Accuracy at step 10: 0.777
Accuracy at step 20: 0.8484
Accuracy at step 30: 0.8958
Accuracy at step 40: 0.9104
Accuracy at step 50: 0.9235
Accuracy at step 60: 0.9296
Accuracy at step 70: 0.9308
Accuracy at step 80: 0.9347
Accuracy at step 90: 0.9348
Adding run metadata for 99
Accuracy at step 100: 0.9388
Accuracy at step 110: 0.9457
Accuracy at step 120: 0.9472
Accuracy at step 130: 0.9491
Accuracy at step 140: 0.9486
Accuracy at step 150: 0.9493
Accuracy at step 160: 0.9532
Accuracy at step 170: 0.9497
Accuracy at step 180: 0.9489
Accuracy at step 190: 0.9545
Adding run metadata for 199
(続く)
終了の確認
k get tfjobs -n anonymous
NAME STATE AGE
mnist Succeeded 6m57s
EFSの確認
efs中身確認用のpodを用意
同じpvcを使い回せばなんでもいいのですが、例えばこんなdeploymentをつくります
apiVersion: apps/v1
kind: Deployment
metadata:
name: eks-toolkit-deployment
namespace: anonymous
labels:
app: eks-toolkit
spec:
replicas: 1
selector:
matchLabels:
app: eks-toolkit
template:
metadata:
labels:
app: eks-toolkit
spec:
containers:
- name: eks-toolkit
image: asahi0301/eks-toolkit
command: ["tail"]
args: ["-f", "/dev/null"]
volumeMounts:
- mountPath: "/data"
name: "data"
volumes:
- name: "data"
persistentVolumeClaim:
claimName: "efs-claim"
確認
Podのシェルに入って、EFSがマウントされていることを確認する
k apply -f eks-toolkit-deployment.yaml
k -n anonymous exec -it eks-toolkit-deployment-7f699fd967-8jfvm bash
Defaulting container name to eks-toolkit.
Use 'kubectl describe pod/eks-toolkit-deployment-7f699fd967-8jfvm -n anonymous' to see all of the containers in this pod.
bash-4.2#
bash-4.2#
bash-4.2# ls
bash-4.2# df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 20959212 4965980 15993232 24% /
tmpfs 65536 0 65536 0% /dev
tmpfs 3932516 0 3932516 0% /sys/fs/cgroup
fs-xxxx.efs.us-west-2.amazonaws.com:/ 9007199254739968 23552 9007199254716416 1% /data
/dev/nvme0n1p1 20959212 4965980 15993232 24% /etc/hosts
shm 65536 4772 60764 8% /dev/shm
tmpfs 3932516 12 3932504 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 3932516 0 3932516 0% /proc/acpi
tmpfs 3932516 0 3932516 0% /sys/firmware
EFSの中身をみてみるろ、ログが保存されていることが分かります
bash-4.2# pwd
/data/logs
bash-4.2# ls
test train
bash-4.2#
TFJOB(分散学習)
yamlの用意
サンプルコードの分散学習を試してみます
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "dist-mnist-pct"
namespace: anonymous
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: emsixteeen/tf-dist-mnist-test:1.0
Worker:
replicas: 2
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: emsixteeen/tf-dist-mnist-test:1.0
実行
k apply -f tf_job_dist_mnist.yaml
確認
k -n anonymous get tfjobs
NAME STATE AGE
dist-mnist-pct Succeeded 3m41s
まとめ
TFJOBを使ったtrainingを行ってみました。
シングルワーカー、分散学習なども試してみてみました。
ストレージはEFSを共有ストレージとして利用しましたが、近々S3で試して見たいと思います