More than 5 years have passed since last update.

EKSにKubeFlow1.0.2を構築する

Last updated at 2020-07-28Posted at 2020-07-27

はじめに

Installing Kubeflowページにある「Cloud deployment」をやってみます。全体の手順はこちらに従っています。

※ Q.なぜGCPを使わないのか A.会社でAWSを使うからです。
k8sを使うだけならGCPが圧倒的に使いやすいです。比較記事を参照してください。

環境

開発PC（macOS Catalina 10.15.6（19G73））
AWS

手順

AWS CLI のインストール

AWS CLIをインストールします。手順はこちらを参考にしてください。
AWS CLIのコンフィグレーションを実行しておきます。手順はこちらを参考にしてください。

$ aws --version
aws-cli/1.18.105 Python/3.6.5 Darwin/19.6.0 botocore/1.17.28

kubectl のインストール

開発PCにkubectlをインストールします手順はこちらを参考にしてください。Homebrewを使ってインストールできます。

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.4", GitCommit:"c96aede7b5205121079932896c4ad89bb93260af", GitTreeState:"clean", BuildDate:"2020-06-18T02:59:13Z", GoVersion:"go1.14.3", Compiler:"gc", Platform:"darwin/amd64"}

eksctl (version 0.1.31 or newer)と aws-iam-authenticator のインストール

eksctlをインストールします手順はこちらを参考にしてください。Homebrewを使ってインストールできます。

$ eksctl version
0.24.0

AWS IAM Authenticator for Kubernetes のインストールですが、AWS CLI のバージョン 1.16.156 以降をインストールしている場合はAWS CLIに内包されているのでインストールは不要です。代わりにaws eks get-tokenコマンドを使用します。今回はAWS CLIを使用します。
インストールする場合の手順はこちらを参考にしてください。

EKS クラスタ作成

EKSの解説はこちらの記事がとてもわかり易いので、事前に一読しておくことを推奨します。

eksctlを使用してクラスタを作成します。Overview of Deployment on Existing Clustersによると、2020年7月現在、Kubeflow 1.0 はKubernetesのバージョン1.14または1.15のみ動作確認が行われています。一方でAWSのユーザーガイドによると、eksctlがサポートしているKubernetesのバージョンは1.15、1.16、1.17です。したがって今回は両方を満たすバージョン1.15を使用します。

ワーカーノードの要件はMinimum system requirementsに記載があります。

4 CPU
50 GB storage
12 GB memory

これを満たすインスタンスとしてm5.largeを使用します。インスタンスのスペックはこちらから確認できます。ボリュームサイズはクラスタ作成時のコマンドで指定します。

次のコマンドでクラスタを作成します。今回はkubeflowという名前のクラスタを作成します。リージョンは東京を選択しました（使用可能なリージョンはこちら）。ワーカーノード数は3、ボリュームのサイズは70GBとしています（ワーカーノード数が少ないとセカンダリIPが足りなくてPendingしてしまいますので3以上推奨です。）。また、ワーカーノードはEC2を使用するので、--managedオプションを使用します。
コマンドの詳細はEKS解説記事が分かりやすいです。またはeksctlの公式ドキュメントを参照してください。

$ eksctl create cluster \
--name kubeflow \
--version 1.15 \
--region ap-northeast-1 \
--nodegroup-name standard-workers \
--node-type m5.large \
--nodes 3 \
--nodes-min 2 \
--nodes-max 3 \
--node-volume-size 70 \
--managed

eksctlは、AWSの複数のリソースを作成してkubernetesを使用するための環境を作成します。eksctlが具体的に何をやっているのかはクラスメソッドさんの記事が分かりやすいです。

次のようなログが標準出力に出力され、クラスタが作成され始めます。クラスターのプロビジョニングには通常、10 ~ 15 分かかります。

[ℹ]  eksctl version 0.24.0
[ℹ]  using region ap-northeast-1
[ℹ]  setting availability zones to [ap-northeast-1a ap-northeast-1c ap-northeast-1d]
[ℹ]  subnets for ap-northeast-1a - public:192.168.0.0/19 private:192.168.96.0/19
[ℹ]  subnets for ap-northeast-1c - public:192.168.32.0/19 private:192.168.128.0/19
[ℹ]  subnets for ap-northeast-1d - public:192.168.64.0/19 private:192.168.160.0/19
[ℹ]  nodegroup "standard-workers" will use "ami-0bb8387c124770444" [AmazonLinux2/1.15]
[ℹ]  using Kubernetes version 1.15
[ℹ]  creating EKS cluster "kubeflow" in "ap-northeast-1" region with un-managed nodes
[ℹ]  will create 2 separate CloudFormation stacks for cluster itself and the initial nodegroup
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=ap-northeast-1 --cluster=kubeflow'
[ℹ]  CloudWatch logging will not be enabled for cluster "kubeflow" in "ap-northeast-1"
[ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=ap-northeast-1 --cluster=kubeflow'
[ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "kubeflow" in "ap-northeast-1"
[ℹ]  2 sequential tasks: { create cluster control plane "kubeflow", 2 sequential sub-tasks: { no tasks, create nodegroup "standard-workers" } }
[ℹ]  building cluster stack "eksctl-kubeflow-cluster"
[ℹ]  deploying stack "eksctl-kubeflow-cluster"

作成されたら次のコマンドで応答が返ってくるか確認します。

$ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   19m

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-14f01f", GitCommit:"14f01fe8f04411d5e187b220034ca2117d79f7de", GitTreeState:"clean", BuildDate:"2020-05-23T21:32:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

コンソールで作成されたリソースを確認することが出来ます。

EKSが使えるか、少し試してみましょう。

$ kubectl create deployment hello-node --image=k8s.gcr.io/echoserver:1.4
deployment.apps/hello-node created

$ kubectl get deployments
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
hello-node   1/1     1            1           72s

$ kubectl get replicasets
NAME                   DESIRED   CURRENT   READY   AGE
hello-node-bcbcd7f76   1         1         1       2m17s

$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
hello-node-bcbcd7f76-skspt   1/1     Running   0          8s

これでサンプルデプロイメントが作成されました。
次にkubectl expose コマンドを使用してPodをインターネットに公開します。

$ kubectl expose deployment hello-node --type=LoadBalancer --port=8080
service/hello-node exposed

$ kubectl get services
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                                                                    PORT(S)          AGE
hello-node   LoadBalancer   10.100.115.107   xxxxx.ap-northeast-1.elb.amazonaws.com   8080:30979/TCP   31s
kubernetes   ClusterIP      10.100.0.1       <none>                                                                         443/TCP          40m

xxxxx.ap-northeast-1.elb.amazonaws.com:8080にアクセスすると次のレスポンスが返ってきてサーバが公開されていることがわかります。

CLIENT VALUES:
client_address=xx.xx.xx.xx
command=GET
real path=/
query=nil
request_version=1.1
request_uri=http://xxxxx.ap-northeast-1.elb.amazonaws.com:8080/

SERVER VALUES:
server_version=nginx: 1.10.0 - lua: 10001

HEADERS RECEIVED:
accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding=gzip, deflate
accept-language=ja-JP,ja;q=0.9,en-US;q=0.8,en;q=0.7
cache-control=max-age=0
connection=keep-alive
host=xxxxx.ap-northeast-1.elb.amazonaws.com:8080
upgrade-insecure-requests=1
user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36
BODY:
-no body in request-

確認できたらリソースを削除します。

$ kubectl delete service hello-node
service "hello-node" deleted

$ kubectl delete deployment hello-node
deployment.extensions "hello-node" deleted

このHelloWorldはこちらで確認できます。

以上でkubernetesクラスタの準備が完了しました。

KubeFlow構築

こちらの手順に従ってインストールしていきます。

まずkfctlをインストールします。Macなのでplatformはdarwinです。

$ wget https://github.com/kubeflow/kfctl/releases/download/v1.0.2/kfctl_v1.0.2-0-ga476281_darwin.tar.gz
$ tar -xvf kfctl_v1.0.2-0-ga476281_darwin.tar.gz
$ mv kfctl /usr/local/bin/kfctl
$ rm kfctl_v1.0.2-0-ga476281_darwin.tar.gz

パスが通っているかの確認ついでにバージョンを確認します。

$ kfctl version
kfctl v1.0.2-0-ga476281

インストールに必要な変数をセットします。

$ export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_aws.v1.0.2.yaml"
$ export AWS_CLUSTER_NAME=kubeflow
$ export KF_NAME=${AWS_CLUSTER_NAME}
$ export BASE_DIR=~/kubeflow
$ export KF_DIR=${BASE_DIR}/${KF_NAME}

コンフィグファイルをダウンロードします。

$ mkdir -p ${KF_DIR}
$ cd ${KF_DIR}

$ wget -O kfctl_aws.yaml $CONFIG_URI
$ export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml

コンフィグを編集します。KubeFlowではv1.0.1からKubeFlowのServiceAccountにAWS IAM Roleが使用できるようなのですが、v1.0.2でやってみたところ、Roleが作成できないというエラーが発生してしまい、うまくいきませんでした。そこで従来のUse Node Group Roleの方法でRoleを設定します。

コンフィグ中のkubeflow-aws を ${AWS_CLUSTER_NAME}に置き換えます。

$ sed -i'.bak' -e 's/kubeflow-aws/'"$AWS_CLUSTER_NAME"'/' ${CONFIG_FILE}
$ aws iam list-roles \
    | jq -r ".Roles[] \
    | select(.RoleName \
    | startswith(\"eksctl-$AWS_CLUSTER_NAME\") and contains(\"NodeInstanceRole\")) \
    .RoleName"

eksctl-kubeflow-nodegroup-standar-NodeInstanceRole-1X5H1J5YPHLPC

標準出力に出力されたRole名をコピーし、コンフィグの.spec.plugins.spec.rolesに設定します。

  plugins:
  - kind: KfAwsPlugin
    metadata:
      name: aws
    spec:
      auth:
        basicAuth:
          password:
            name: password
          username: admin
      region: ap-northeast-1
      roles:
      - eksctl-kubeflow-nodegroup-standar-NodeInstanceRole-1X5H1J5YPHLPC

KubeFlowをデプロイします。

$ cd ${KF_DIR}
$ kfctl apply -V -f ${CONFIG_FILE}

全てのリソースがreadyになるのを待ちます。次のコマンドで確認できます。

$ kubectl -n kubeflow get all

STATUSを見てRunningまたはCompletedになっていれば問題ないと思います。nvidia-device-plugin-daemonsetがPendingのままなのですが、GPUインスタンスを使っていないからだと思います（多分）。なので今回はスルーします。

NAME                                                               READY   STATUS      RESTARTS   AGE
pod/admission-webhook-bootstrap-stateful-set-0                     1/1     Running     0          19h
pod/admission-webhook-deployment-569558c8b6-dbwqc                  1/1     Running     0          19h
pod/alb-ingress-controller-7c4f854447-d2pwk                        1/1     Running     0          19h
pod/application-controller-stateful-set-0                          1/1     Running     0          19h
pod/argo-ui-7ffb9b6577-5rkzb                                       1/1     Running     0          19h
pod/centraldashboard-659bd78c-j5zqs                                1/1     Running     0          19h
pod/jupyter-web-app-deployment-679d5f5dc4-2xwn2                    1/1     Running     0          19h
pod/katib-controller-7f58569f7d-dsb92                              1/1     Running     1          19h
pod/katib-db-manager-54b66f9f9d-g6pxr                              1/1     Running     1          19h
pod/katib-mysql-dcf7dcbd5-mvq42                                    1/1     Running     0          19h
pod/katib-ui-6f97756598-5fc7k                                      1/1     Running     0          19h
pod/kfserving-controller-manager-0                                 2/2     Running     1          19h
pod/metacontroller-0                                               1/1     Running     0          19h
pod/metadata-db-65fb5b695d-442jv                                   1/1     Running     0          19h
pod/metadata-deployment-65ccddfd4c-xn7cp                           1/1     Running     0          19h
pod/metadata-envoy-deployment-7754f56bff-fjcck                     1/1     Running     0          19h
pod/metadata-grpc-deployment-5c6db9749-pb752                       1/1     Running     4          19h
pod/metadata-ui-7c85545947-nl8h4                                   1/1     Running     0          19h
pod/minio-6b67f98977-jcz5j                                         1/1     Running     0          19h
pod/ml-pipeline-6cf777c7bc-d7dlc                                   1/1     Running     0          19h
pod/ml-pipeline-ml-pipeline-visualizationserver-6d744dd449-rr8df   1/1     Running     0          19h
pod/ml-pipeline-persistenceagent-5c549847fd-4c5bb                  1/1     Running     0          19h
pod/ml-pipeline-scheduledworkflow-674777d89c-rvrtc                 1/1     Running     0          19h
pod/ml-pipeline-ui-549b5b6744-hkv7d                                1/1     Running     0          19h
pod/ml-pipeline-viewer-controller-deployment-fc7f7cb65-thjm5       1/1     Running     0          19h
pod/mpi-operator-548d8cdbbd-ndwkc                                  1/1     Running     0          19h
pod/mysql-85bc64f5c4-gvwph                                         1/1     Running     0          19h
pod/notebook-controller-deployment-5c55f5845b-n9xgn                1/1     Running     0          19h
pod/nvidia-device-plugin-daemonset-5rmbb                           0/1     Pending     0          19h
pod/nvidia-device-plugin-daemonset-7pqv2                           0/1     Pending     0          19h
pod/nvidia-device-plugin-daemonset-j4jc7                           1/1     Running     0          18m
pod/profiles-deployment-57d44f597d-mhl87                           2/2     Running     0          19h
pod/pytorch-operator-cf8c5c497-gbt6z                               1/1     Running     0          19h
pod/seldon-controller-manager-6b4b969447-5vzqx                     1/1     Running     0          19h
pod/spark-operatorcrd-cleanup-h8454                                0/2     Completed   0          19h
pod/spark-operatorsparkoperator-76dd5f5688-7d4fv                   1/1     Running     0          19h
pod/spartakus-volunteer-57d9875c96-sr8lj                           1/1     Running     0          19h
pod/tensorboard-5f685f9d79-745tf                                   1/1     Running     0          19h
pod/tf-job-operator-5fb85c5fb7-79hn9                               1/1     Running     0          19h
pod/workflow-controller-689d6c8846-t2wvp                           1/1     Running     0          19h

NAME                                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/admission-webhook-service                      ClusterIP   10.100.51.140    <none>        443/TCP             19h
service/application-controller-service                 ClusterIP   10.100.15.143    <none>        443/TCP             19h
service/argo-ui                                        NodePort    10.100.18.201    <none>        80:31463/TCP        19h
service/centraldashboard                               ClusterIP   10.100.16.227    <none>        80/TCP              19h
service/jupyter-web-app-service                        ClusterIP   10.100.60.58     <none>        80/TCP              19h
service/katib-controller                               ClusterIP   10.100.11.97     <none>        443/TCP,8080/TCP    19h
service/katib-db-manager                               ClusterIP   10.100.250.190   <none>        6789/TCP            19h
service/katib-mysql                                    ClusterIP   10.100.27.243    <none>        3306/TCP            19h
service/katib-ui                                       ClusterIP   10.100.215.210   <none>        80/TCP              19h
service/kfserving-controller-manager-metrics-service   ClusterIP   10.100.40.10     <none>        8443/TCP            19h
service/kfserving-controller-manager-service           ClusterIP   10.100.41.230    <none>        443/TCP             19h
service/kfserving-webhook-server-service               ClusterIP   10.100.124.190   <none>        443/TCP             19h
service/metadata-db                                    ClusterIP   10.100.174.250   <none>        3306/TCP            19h
service/metadata-envoy-service                         ClusterIP   10.100.80.246    <none>        9090/TCP            19h
service/metadata-grpc-service                          ClusterIP   10.100.247.63    <none>        8080/TCP            19h
service/metadata-service                               ClusterIP   10.100.59.187    <none>        8080/TCP            19h
service/metadata-ui                                    ClusterIP   10.100.234.158   <none>        80/TCP              19h
service/minio-service                                  ClusterIP   10.100.78.186    <none>        9000/TCP            19h
service/ml-pipeline                                    ClusterIP   10.100.0.209     <none>        8888/TCP,8887/TCP   19h
service/ml-pipeline-ml-pipeline-visualizationserver    ClusterIP   10.100.115.213   <none>        8888/TCP            19h
service/ml-pipeline-tensorboard-ui                     ClusterIP   10.100.148.88    <none>        80/TCP              19h
service/ml-pipeline-ui                                 ClusterIP   10.100.112.177   <none>        80/TCP              19h
service/mysql                                          ClusterIP   10.100.92.26     <none>        3306/TCP            19h
service/notebook-controller-service                    ClusterIP   10.100.107.0     <none>        443/TCP             19h
service/profiles-kfam                                  ClusterIP   10.100.176.124   <none>        8081/TCP            19h
service/pytorch-operator                               ClusterIP   10.100.53.116    <none>        8443/TCP            19h
service/seldon-webhook-service                         ClusterIP   10.100.11.184    <none>        443/TCP             19h
service/tensorboard                                    ClusterIP   10.100.165.203   <none>        9000/TCP            19h
service/tf-job-operator                                ClusterIP   10.100.7.225     <none>        8443/TCP            19h

NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/nvidia-device-plugin-daemonset   3         3         1       3            1           <none>          19h

NAME                                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/admission-webhook-deployment                  1/1     1            1           19h
deployment.apps/alb-ingress-controller                        1/1     1            1           19h
deployment.apps/argo-ui                                       1/1     1            1           19h
deployment.apps/centraldashboard                              1/1     1            1           19h
deployment.apps/jupyter-web-app-deployment                    1/1     1            1           19h
deployment.apps/katib-controller                              1/1     1            1           19h
deployment.apps/katib-db-manager                              1/1     1            1           19h
deployment.apps/katib-mysql                                   1/1     1            1           19h
deployment.apps/katib-ui                                      1/1     1            1           19h
deployment.apps/metadata-db                                   1/1     1            1           19h
deployment.apps/metadata-deployment                           1/1     1            1           19h
deployment.apps/metadata-envoy-deployment                     1/1     1            1           19h
deployment.apps/metadata-grpc-deployment                      1/1     1            1           19h
deployment.apps/metadata-ui                                   1/1     1            1           19h
deployment.apps/minio                                         1/1     1            1           19h
deployment.apps/ml-pipeline                                   1/1     1            1           19h
deployment.apps/ml-pipeline-ml-pipeline-visualizationserver   1/1     1            1           19h
deployment.apps/ml-pipeline-persistenceagent                  1/1     1            1           19h
deployment.apps/ml-pipeline-scheduledworkflow                 1/1     1            1           19h
deployment.apps/ml-pipeline-ui                                1/1     1            1           19h
deployment.apps/ml-pipeline-viewer-controller-deployment      1/1     1            1           19h
deployment.apps/mpi-operator                                  1/1     1            1           19h
deployment.apps/mysql                                         1/1     1            1           19h
deployment.apps/notebook-controller-deployment                1/1     1            1           19h
deployment.apps/profiles-deployment                           1/1     1            1           19h
deployment.apps/pytorch-operator                              1/1     1            1           19h
deployment.apps/seldon-controller-manager                     1/1     1            1           19h
deployment.apps/spark-operatorsparkoperator                   1/1     1            1           19h
deployment.apps/spartakus-volunteer                           1/1     1            1           19h
deployment.apps/tensorboard                                   1/1     1            1           19h
deployment.apps/tf-job-operator                               1/1     1            1           19h
deployment.apps/workflow-controller                           1/1     1            1           19h

NAME                                                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/admission-webhook-deployment-569558c8b6                  1         1         1       19h
replicaset.apps/alb-ingress-controller-7c4f854447                        1         1         1       19h
replicaset.apps/argo-ui-7ffb9b6577                                       1         1         1       19h
replicaset.apps/centraldashboard-659bd78c                                1         1         1       19h
replicaset.apps/jupyter-web-app-deployment-679d5f5dc4                    1         1         1       19h
replicaset.apps/katib-controller-7f58569f7d                              1         1         1       19h
replicaset.apps/katib-db-manager-54b66f9f9d                              1         1         1       19h
replicaset.apps/katib-mysql-dcf7dcbd5                                    1         1         1       19h
replicaset.apps/katib-ui-6f97756598                                      1         1         1       19h
replicaset.apps/metadata-db-65fb5b695d                                   1         1         1       19h
replicaset.apps/metadata-deployment-65ccddfd4c                           1         1         1       19h
replicaset.apps/metadata-envoy-deployment-7754f56bff                     1         1         1       19h
replicaset.apps/metadata-grpc-deployment-5c6db9749                       1         1         1       19h
replicaset.apps/metadata-ui-7c85545947                                   1         1         1       19h
replicaset.apps/minio-6b67f98977                                         1         1         1       19h
replicaset.apps/ml-pipeline-6cf777c7bc                                   1         1         1       19h
replicaset.apps/ml-pipeline-ml-pipeline-visualizationserver-6d744dd449   1         1         1       19h
replicaset.apps/ml-pipeline-persistenceagent-5c549847fd                  1         1         1       19h
replicaset.apps/ml-pipeline-scheduledworkflow-674777d89c                 1         1         1       19h
replicaset.apps/ml-pipeline-ui-549b5b6744                                1         1         1       19h
replicaset.apps/ml-pipeline-viewer-controller-deployment-fc7f7cb65       1         1         1       19h
replicaset.apps/mpi-operator-548d8cdbbd                                  1         1         1       19h
replicaset.apps/mysql-85bc64f5c4                                         1         1         1       19h
replicaset.apps/notebook-controller-deployment-5c55f5845b                1         1         1       19h
replicaset.apps/profiles-deployment-57d44f597d                           1         1         1       19h
replicaset.apps/pytorch-operator-cf8c5c497                               1         1         1       19h
replicaset.apps/seldon-controller-manager-6b4b969447                     1         1         1       19h
replicaset.apps/spark-operatorsparkoperator-76dd5f5688                   1         1         1       19h
replicaset.apps/spartakus-volunteer-57d9875c96                           1         1         1       19h
replicaset.apps/tensorboard-5f685f9d79                                   1         1         1       19h
replicaset.apps/tf-job-operator-5fb85c5fb7                               1         1         1       19h
replicaset.apps/workflow-controller-689d6c8846                           1         1         1       19h

NAME                                                        READY   AGE
statefulset.apps/admission-webhook-bootstrap-stateful-set   1/1     19h
statefulset.apps/application-controller-stateful-set        1/1     19h
statefulset.apps/kfserving-controller-manager               1/1     19h
statefulset.apps/metacontroller                             1/1     19h

NAME                                  COMPLETIONS   DURATION   AGE
job.batch/spark-operatorcrd-cleanup   1/1           69s        19h

istio-ingressgatewayを使用してポートフォワードします。

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80

127.0.0.1:8080にアクセスするとダッシュボードが開きます。

「Start Setup」をクリックします。

namespaceを設定します。今回はデフォルトのanonymousとしました。

これでKubeFlowを使う準備が整いました。

構築の次へ

KubeFlowの準備ができたら次は下記のようなことを試してみてください。

また、今のままだとKubeFlowに認証がないので、Cognitoを使って認証の仕組みを用意する必要があります。ロードバランサを作成して外部からアクセスできるようにした方が便利です。また、クラスター作成時に設定をオプションで与えていましたが、コンフィグファイルcluster.yamlを作成したほうが便利です。

おまけ

クラスターのスケール

notebookのpodを作ろうと思ったらInsufficient podsになってしまったので増強させます。
クラスターのスケールは下記のように行います。

eksctl scale nodegroup --cluster=kubeflow --nodes=5 standard-workers --nodes-max 5

リソース削除

KubeFlowを削除します。

cd ${KF_DIR}
kfctl delete -f ${CONFIG_FILE}

kubernetesを削除します。

$ eksctl delete cluster kubeflow

クラスター削除時に次のようなエラーが出るかもしれません。

[ℹ]  eksctl version 0.24.0
[ℹ]  using region ap-northeast-1
[ℹ]  deleting EKS cluster "kubeflow"
[ℹ]  deleted 0 Fargate profile(s)
[✔]  kubeconfig has been updated
[ℹ]  cleaning up LoadBalancer services
[ℹ]  2 sequential tasks: { delete nodegroup "standard-workers", delete cluster control plane "kubeflow" [async] }
[ℹ]  will delete stack "eksctl-kubeflow-nodegroup-standard-workers"
[ℹ]  waiting for stack "eksctl-kubeflow-nodegroup-standard-workers" to get deleted
[✖]  unexpected status "DELETE_FAILED" while waiting for CloudFormation stack "eksctl-kubeflow-nodegroup-standard-workers"
[ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
[✖]  AWS::CloudFormation::Stack/eksctl-kubeflow-nodegroup-standard-workers: DELETE_FAILED – "The following resource(s) failed to delete: [NodeInstanceRole]. "
[✖]  AWS::IAM::Role/NodeInstanceRole: DELETE_FAILED – "Cannot delete entity, must delete policies first. (Service: AmazonIdentityManagement; Status Code: 409; Error Code: DeleteConflict; Request ID: 238db5e1-9d5e-4698-9d3e-eecfe5b229aa)"
[ℹ]  1 error(s) occurred while deleting cluster with nodegroup(s)
[✖]  waiting for CloudFormation stack "eksctl-kubeflow-nodegroup-standard-workers": ResourceNotReady: failed waiting for successful resource state
Error: failed to delete cluster with nodegroup(s)

そのときはIAMのコンソールからNodeで検索してロールを削除し、もう一度実行します。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up