More than 1 year has passed since last update.

vSphere Machine Learning Extension (Kubeflow)

Last updated at 2023-11-03Posted at 2023-11-03

vSphere Machine Learning Extensionは、VMwareが開発するvSphere環境向けのKubeFlowディストリビューションです。以下のKubeflowコアコンポーネントとvSphereプラットフォーム用に設計された新しいコンポーネントが含まれており、vSpher with Tanzuで環境向けにCarvel Packageとして公開されています。

Requirements

vSphere Machine Learning Extensionを利用するには、以下が環境が必要です。vSphere with Tanzu + NVIDIA AI Enterpriseを利用して、Tanzu Kubernetes GridからvGPUが利用できる環境でvSphere Machine Learning Extensionを有効化してみました。

vSphere with Tanzu (vSphere 7もしくはvSphere 8)
vGPUが利用できる環境
Kubernetes 1.21以降
必要リソース
- 4 CPU
- 16GB memory
- 50GB storage

Tanzu Kubernetes Gridのデプロイ

NVIDIA A100 GPUを利用するためのVirtualMachineClassとしてa100-40gを作成し、TanzuKubernetesCluster v1alpha3 APIにより、gpu-poolとして2台のGPUノード、best-effort-largeとして3台のCPUノードを作成します。なお、GPUを利用するコンテナイメージはサイズが大きいため、GPUノードの/var/lib/containerd、/var/lib/kubeletに追加ストレージを利用します。

apiVersion: run.tanzu.vmware.com/v1alpha3
kind: TanzuKubernetesCluster
metadata:
  name: gpu-cluster
  annotations:
    run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu
spec:
  distribution:
    fullVersion: v1.24.9---vmware.1-tkg.4
    version: v1.24.9---vmware.1-tkg.4
  settings:
    network:
      cni:
        name: antrea
      pods:
        cidrBlocks:
        - 172.20.0.0/16
      serviceDomain: cluster.local
      services:
        cidrBlocks:
        - 10.96.0.0/16
    storage:
      defaultClass: k8s-storage
  topology:
    controlPlane:
      replicas: 3
      storageClass: k8s-storage
      vmClass: best-effort-large
      tkr:
        reference:
          name: v1.24.9---vmware.1-tkg.4
    nodePools:
    - name: gpu-pool
      replicas: 2
      storageClass: gpu-k8s
      vmClass: a100-40g
      tkr:
        reference:
          name: v1.24.9---vmware.1-tkg.4
      volumes:
      - name: containerd
        capacity:
          storage: 100Gi
        mountPath: /var/lib/containerd
        storageClass: k8s-storage
      - name: kubelet
        capacity:
          storage: 50Gi
        mountPath: /var/lib/kubelet
        storageClass: unity-k8s
    - name: best-effort-large
      replicas: 2
      storageClass: k8s-storage
      vmClass: best-effort-large
      tkr:
        reference:
          name: v1.24.9---vmware.1-tkg.4

クラスターの作成が完了すると、3台のコントロールプレーンと、4台のノードで構成されるTanzu Kubernetes Gridクラスターが構成されます。

$ kubectl get node
NAME                                                  STATUS   ROLES           AGE     VERSION
gpu-cluster-59lhp-5bvf4                               Ready    control-plane   6m21s   v1.24.9+vmware.1
gpu-cluster-59lhp-jtvsk                               Ready    control-plane   3m43s   v1.24.9+vmware.1
gpu-cluster-59lhp-tn6jl                               Ready    control-plane   8m26s   v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-6qx6l   Ready    <none>          6m56s   v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-mmrw7   Ready    <none>          6m58s   v1.24.9+vmware.1
gpu-cluster-gpu-pool-4nqht-575454ddc5-sjxtb           Ready    <none>          6m39s   v1.24.9+vmware.1
gpu-cluster-gpu-pool-4nqht-575454ddc5-wj6dw           Ready    <none>          6m35s   v1.24.9+vmware.1

GPU Operatorのインストール

NVIDIA AI EnterpriseのHelm Repositoryを登録して、GPU Operatorをインストールします。(ライセンス登録などが必要ですが、ここでは省略します。詳細はこちらで説明してます。)

$ kubectl create ns gpu-operator
$ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=${NGC_TOKEN}
$ helm install --wait gpu-operator nvaie/gpu-operator-4-0 -n gpu-operator

GPU Operatorのインストールが完了すると、GPUデバイスプラグイン等のPodが起動します。

$ kubectl get pod -n gpu-operator
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-tntqk                                   1/1     Running     0          12m
gpu-feature-discovery-xrwj8                                   1/1     Running     0          12m
gpu-operator-5d49c4db78-8tc6j                                 1/1     Running     0          13m
gpu-operator-node-feature-discovery-master-5678c7dbb4-zssc7   1/1     Running     0          13m
gpu-operator-node-feature-discovery-worker-55w27              1/1     Running     0          13m
gpu-operator-node-feature-discovery-worker-68lb7              1/1     Running     0          13m
gpu-operator-node-feature-discovery-worker-8tzvd              1/1     Running     0          13m
gpu-operator-node-feature-discovery-worker-p6pv4              1/1     Running     0          13m
gpu-operator-node-feature-discovery-worker-t2hkv              1/1     Running     0          13m
gpu-operator-node-feature-discovery-worker-tvh6b              1/1     Running     0          13m
gpu-operator-node-feature-discovery-worker-zczjg              1/1     Running     0          13m
nvidia-container-toolkit-daemonset-c4tp8                      1/1     Running     0          12m
nvidia-container-toolkit-daemonset-nsm49                      1/1     Running     0          12m
nvidia-cuda-validator-88v2v                                   0/1     Completed   0          10m
nvidia-cuda-validator-rzxq2                                   0/1     Completed   0          10m
nvidia-dcgm-exporter-57254                                    1/1     Running     0          12m
nvidia-dcgm-exporter-n97kq                                    1/1     Running     0          12m
nvidia-device-plugin-daemonset-gqgxb                          1/1     Running     0          12m
nvidia-device-plugin-daemonset-hcl57                          1/1     Running     0          12m
nvidia-driver-daemonset-7sqzs                                 1/1     Running     0          13m
nvidia-driver-daemonset-jc85m                                 1/1     Running     0          13m
nvidia-operator-validator-nmwt6                               1/1     Running     0          12m
nvidia-operator-validator-nrmjf                               1/1     Running     0          12m

vGPUを持つノードにはnvidia.com/gpu.countラベルが追加されます。

$ kubectl get node -L nvidia.com/gpu.count
NAME                                                  STATUS   ROLES           AGE   VERSION            GPU.COUNT
gpu-cluster-59lhp-5bvf4                               Ready    control-plane   19m   v1.24.9+vmware.1
gpu-cluster-59lhp-jtvsk                               Ready    control-plane   16m   v1.24.9+vmware.1
gpu-cluster-59lhp-tn6jl                               Ready    control-plane   21m   v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-6qx6l   Ready    <none>          19m   v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-mmrw7   Ready    <none>          19m   v1.24.9+vmware.1
gpu-cluster-gpu-pool-4nqht-575454ddc5-sjxtb           Ready    <none>          19m   v1.24.9+vmware.1   1
gpu-cluster-gpu-pool-4nqht-575454ddc5-wj6dw           Ready    <none>          19m   v1.24.9+vmware.1   1

Kubeflowのデプロイ

vSphere Machine Learning ExtensionではkubeflowがCarvel Packageとして提供されています。

ネームスペースの作成

kubectl create ns carvel-kubeflow

vSphere Machine Learning ExtensionのKubeflow　Carvel Packageのリポジトリを追加します。

tanzu package repository add kubeflow-carvel-repo -n carvel-kubeflow ¥
  --url projects.registry.vmware.com/kubeflow/kubeflow-carvel-repo:1.6.1

リポジトリが確認されたことを確認します。

$ tanzu package repository list -A

  NAMESPACE        NAME                  SOURCE                                                                     STATUS
  carvel-kubeflow  kubeflow-carvel-repo  (imgpkg) projects.registry.vmware.com/kubeflow/kubeflow-carvel-repo:1.6.1  Reconcile succeeded

追加したリポジトリに含まれるパッケージを確認します。

$ tanzu package available list -n carvel-kubeflow

  NAME                                 DISPLAY-NAME
  kubeflow.community.tanzu.vmware.com  kubeflow

$ tanzu package available list -n carvel-kubeflow kubeflow.community.tanzu.vmware.com

  NAME                                 VERSION  RELEASED-AT
  kubeflow.community.tanzu.vmware.com  1.6.0    -
  kubeflow.community.tanzu.vmware.com  1.6.1    -

kubeflow.community.tanzu.vmware.comのインストールするvalues.yamlを以下の内容で作成します。

cat <<EOF > values.yaml

service_type: "LoadBalancer"

IP_address: ""
CD_REGISTRATION_FLOW: True
EOF

values.yamlで指定可能なパラメータは以下のコマンドで確認可能です。ダッシュボードの認証に利用するDexの設定が可能です。

tanzu package available get kubeflow.community.tanzu.vmware.com/1.6.1 -n carvel-kubeflow --values-schema

  KEY                                  DEFAULT                                                                 TYPE     DESCRIPTION
  Dex.config                           |-   issuer: http://dex.auth.svc.cluster.local:5556/dex   storage:      string   Configuration file of Dex
                                           type: kubernetes     config:       inCluster: true   web:
                                          http: 0.0.0.0:5556   logger:     level: "debug"     format:
                                       text   oauth2:     skipApprovalScreen: true   enablePasswordDB:
                                       true   staticPasswords:   - email: user@example.com     hash:
                                       $2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
                                       # https://github.com/dexidp/dex/pull/1601/commits     # FIXME: Use
                                       hashFromEnv instead     username: user     userID: "15841185641784"
                                       staticClients:   # https://github.com/dexidp/dex/pull/1664   - idEnv:
                                       OIDC_CLIENT_ID     redirectURIs: ["/login/oidc"]     name: 'Dex Login
                                       Application'     secretEnv: OIDC_CLIENT_SECRET
  Dex.use_external                     false                                                                   boolean  If set to True, the embedded Dex will not be created, and you will need to
                                                                                                                        configure OIDC_Authservice with external IdP manually
  IP_address                           ""                                                                      string   EXTERNAL_IP address of istio-ingressgateway, valid only if service_type is
                                                                                                                        LoadBalancer
  OIDC_Authservice.OIDC_PROVIDER       http://dex.auth.svc.cluster.local:5556/dex                              string   URL to your OIDC provider. AuthService expects to find information about your
                                                                                                                        OIDC provider at OIDC_PROVIDER/.well-known/openid-configuration, and will use
                                                                                                                        this information to contact your OIDC provider and initiate an OIDC flow later
                                                                                                                        on
  OIDC_Authservice.OIDC_SCOPES         profile email groups                                                    string   Comma-separated list of scopes to request access to. The openid scope is always
                                                                                                                        added.
  OIDC_Authservice.OIDC_CLIENT_ID      kubeflow-oidc-authservice                                               string   AuthService will use this Client ID when it needs to contact your OIDC provider
                                                                                                                        and initiate an OIDC flow
  OIDC_Authservice.OIDC_CLIENT_SECRET  pUBnBOY80SnXgjibTYM9ZWNzY2xreNGQok                                      string   AuthService will use this Client Secret to authenticate itself against your
                                                                                                                        OIDC provider in combination with CLIENT_ID when attempting to access your OIDC
                                                                                                                        Provider's protected endpoints
  OIDC_Authservice.REDIRECT_URL        /login/oidc                                                             string   AuthService will pass this URL to the OIDC provider when initiating an OIDC
                                                                                                                        flow, so the OIDC provider knows where it needs to send the OIDC authorization
                                                                                                                        code to. It defaults to AUTHSERVICE_URL_PREFIX/oidc/callback. This assumes that
                                                                                                                        you have configured your API Gateway to pass all requests under a hostname to
                                                                                                                        Authservice for authentication
  OIDC_Authservice.SKIP_AUTH_URI       /dex                                                                    string   Comma-separated list of URL path-prefixes for which to bypass authentication.
                                                                                                                        For example, if SKIP_AUTH_URL contains /my_app/ then requests to <url>/my_app/*
                                                                                                                        are allowed without checking any credentials. Contains nothing by default
  OIDC_Authservice.USERID_CLAIM        email                                                                   string   Claim whose value will be used as the userid (default email)
  OIDC_Authservice.USERID_HEADER       kubeflow-userid                                                         string   Name of the header containing the user-id that will be added to the upstream
                                                                                                                        request
  OIDC_Authservice.USERID_PREFIX       ""                                                                      string   Prefix to add to the userid, which will be the value of the USERID_HEADER
  OIDC_Authservice.OIDC_AUTH_URL       /dex/auth                                                               string   AuthService will initiate an Authorization Code OIDC flow by hitting this
                                                                                                                        URL. Normally discovered automatically through the OIDC Provider's well-known
                                                                                                                        endpoint
  imageswap_labels                     true                                                                    boolean  Add labels k8s.twr.io/imageswap: enabled to Kubeflow namespaces, which enable
                                                                                                                        imageswap webhook to swap images.
  service_type                         LoadBalancer                                                            string   Service type of istio-ingressgateway. Available options: "LoadBalancer" or
                                                                                                                        "NodePort"
  CD_REGISTRATION_FLOW                 true                                                                    boolean  Turn on Registration Flow, so that Kubeflow Central Dashboard will prompt new
                                                                                                                        users to create a namespace (profile)

作成したvalues.yamlを指定して、Kubeflowパッケージをインストールします。

tanzu package install kubeflow -n carvel-kubeflow \
    --wait-check-interval 5s \
    --wait-timeout 30m0s \
    --package kubeflow.community.tanzu.vmware.com \
    --version 1.6.1 \
    --values-file values.yaml

パッケージのインストールが成功すると、ステータスがReconcile succeededになります。

$ tanzu package installed list -n carvel-kubeflow

  NAME      PACKAGE-NAME                         PACKAGE-VERSION  STATUS
  kubeflow  kubeflow.community.tanzu.vmware.com  1.6.1            Reconcile succeeded

kubeflowネームスペースには、kubeflowに必要な各種Podが作成されます。

$ kubectl get pod -n kubeflow
NAME                                                     READY   STATUS    RESTARTS        AGE
admission-webhook-deployment-df7f58494-4dllm             1/1     Running   0               9m36s
cache-deployer-deployment-597d44b9db-phfhc               2/2     Running   1 (10m ago)     10m
cache-server-784484f445-bmgwp                            2/2     Running   0               10m
centraldashboard-c6f5c9d6c-b7gbf                         2/2     Running   0               9m36s
jupyter-web-app-deployment-c658f5f8-xgts8                1/1     Running   0               9m36s
katib-controller-55d9c7c5f-brr5m                         1/1     Running   0               9m35s
katib-db-manager-6b5fddbb7f-gwkdb                        1/1     Running   0               9m35s
katib-mysql-7b8697c5d4-2j59f                             1/1     Running   0               9m36s
katib-ui-57d9844f79-ttmxw                                1/1     Running   0               9m35s
kserve-controller-manager-0                              2/2     Running   0               9m35s
kserve-models-web-app-7d5f9588dc-vtx59                   2/2     Running   0               9m35s
kubeflow-pipelines-profile-controller-7bdfc6bd4d-jqn2g   1/1     Running   0               10m
metacontroller-0                                         1/1     Running   0               10m
metadata-envoy-deployment-58f6995cf7-vd22p               1/1     Running   0               10m
metadata-grpc-deployment-57f79496b9-4nvdz                2/2     Running   3 (10m ago)     10m
metadata-writer-6f444c4846-kx672                         2/2     Running   0               10m
minio-568564b878-rjmh5                                   2/2     Running   0               10m
ml-pipeline-7466979cdd-r2mtb                             2/2     Running   0               10m
ml-pipeline-persistenceagent-64854cfcf-6vqjm             2/2     Running   0               10m
ml-pipeline-scheduledworkflow-86466fc686-4rpfk           2/2     Running   0               10m
ml-pipeline-ui-6c8b547c5c-c9hxx                          2/2     Running   0               10m
ml-pipeline-viewer-crd-d6d6f646-tmd4h                    2/2     Running   1 (10m ago)     10m
ml-pipeline-visualizationserver-5c959f7d79-m7z8d         2/2     Running   0               10m
mysql-78df8bc87b-d2vrc                                   2/2     Running   0               10m
notebook-controller-deployment-7bc97485dc-6b2ld          2/2     Running   1 (9m28s ago)   9m36s
profiles-deployment-75478d849c-pzwc7                     3/3     Running   1 (8m25s ago)   8m43s
tensorboard-controller-deployment-7b744d76fd-86frx       3/3     Running   1 (7m51s ago)   8m9s
tensorboards-web-app-deployment-c6964646f-ppg4m          1/1     Running   0               8m9s
training-operator-585bc859cd-r5wjr                       1/1     Running   0               8m9s
volumes-web-app-deployment-5cd9b6749d-6czsb              1/1     Running   0               8m9s
workflow-controller-5cf5b47b-wsnqz                       2/2     Running   1 (10m ago)     10m

Kubeflowの利用

KubeflowはCentral DashboardによりGUIが提供されます。接続先は、Ingress Gatewayで公開されています。

$ kubectl get service -n istio-system
NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                                      AGE
authservice             ClusterIP      10.96.180.245   <none>        8080/TCP                                                                     22m
cluster-local-gateway   ClusterIP      10.96.139.166   <none>        15020/TCP,80/TCP                                                             20m
istio-ingressgateway    LoadBalancer   10.96.68.134    10.44.187.1   15021:31472/TCP,80:31632/TCP,443:31012/TCP,31400:30853/TCP,15443:32129/TCP   22m
istiod                  ClusterIP      10.96.133.214   <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP                                        22m
knative-local-gateway   ClusterIP      10.96.145.211   <none>        80/TCP                                                                       21m

istio-ingressgatewayのEXTERNAL-IPにアクセスするとDexログイン画面が表示されるます。デフォルトのユーザー名・パスワードは、user@example.comと12341234です。

初回ログイン時はネームスペース(Profile)の作成求められます。

指定した名前でネームスペース(Profile)が作成され、ログインが完了します。

$ kubectl get profiles.kubeflow.org
NAME                        AGE
kubeflow-user-example-com   16h
masanara                    20m

$ kubectl get namespace masanara
NAME       STATUS   AGE
masanara   Active   21m

メニューからNotebooksを選択し、画面右上の「New Notebook」をクリックすると専用のNotebookを作成することが可能です。

NotebookはPodとして起動するため、利用するDocker Imageを指定することが可能です。今回は、Kubeflowのgitリポジトリを利用してjupyter-tensorflow-cuda:v1.6.1をビルドして、このイメージをHarborに格納して利用しました。

$ git clone https://github.com/kubeflow/kubeflow.git -b v1.6.1
$ cd kubeflow/components/example-notebook-servers/jupyter-tensorflow-full
$ make docker-build-cuda
$  docker images
REPOSITORY                          TAG        IMAGE ID       CREATED          SIZE
jupyter-tensorflow-cuda-full        v1.6.1     942dbf145dd3   35 minutes ago   7.87GB
jupyter-tensorflow-cuda             v1.6.1     0f240490b09d   38 minutes ago   6.77GB
jupyter                             v1.6.1     284a091577e7   51 minutes ago   1.1GB
base                                v1.6.1     e6510aa0ed6d   55 minutes ago   437MB

Notebookを作成してしばらくするとStatusがRunning状態になります。

NotebookリソースはStatefulSetとしてPodとPVCを作成し、VirtualServiceとしてUIが公開されます。

$ kubectl tree notebook demo
W1103 10:24:14.734663   56234 warnings.go:70] Capabilities API in run API group is deprecated, use Capabilities API in core API group
NAMESPACE  NAME                                     READY  REASON  AGE
masanara   Notebook/demo                            -              12m
masanara   ├─Service/demo                           -              12m
masanara   │ └─EndpointSlice/demo-rqt8b             -              12m
masanara   ├─StatefulSet/demo                       -              12m
masanara   │ ├─ControllerRevision/demo-74895549b    -              12m
masanara   │ └─Pod/demo-0                           True           12m
masanara   └─VirtualService/notebook-masanara-demo  -              12m

$ kubectl get pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
demo-volume   Bound    pvc-29f4aa2c-4c2a-46f4-b08c-fc0528496bb9   10Gi       RWO            k8s-storage    13m

Notebooksの「CONNECT」をクリックするとJupyter Notebookが起動し、nvidia-smiを実行するとPodからGPUを利用できていることが確認できました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up