vSphere Machine Learning Extensionは、VMwareが開発するvSphere環境向けのKubeFlowディストリビューションです。以下のKubeflowコアコンポーネントとvSphereプラットフォーム用に設計された新しいコンポーネントが含まれており、vSpher with Tanzuで環境向けにCarvel Packageとして公開されています。
- Jupyter Notebooks
- Kubeflow Pipelines
- Machine Learning Metadata DB (MLMD)
- Katib Hyperparameter Tuning System
- Model Traiing Operators
- Central Dashboard
Requirements
vSphere Machine Learning Extensionを利用するには、以下が環境が必要です。vSphere with Tanzu + NVIDIA AI Enterpriseを利用して、Tanzu Kubernetes GridからvGPUが利用できる環境でvSphere Machine Learning Extensionを有効化してみました。
- vSphere with Tanzu (vSphere 7もしくはvSphere 8)
- vGPUが利用できる環境
- Kubernetes 1.21以降
- 必要リソース
- 4 CPU
- 16GB memory
- 50GB storage
Tanzu Kubernetes Gridのデプロイ
NVIDIA A100 GPUを利用するためのVirtualMachineClass
としてa100-40g
を作成し、TanzuKubernetesCluster v1alpha3 API
により、gpu-pool
として2台のGPUノード、best-effort-large
として3台のCPUノードを作成します。なお、GPUを利用するコンテナイメージはサイズが大きいため、GPUノードの/var/lib/containerd
、/var/lib/kubelet
に追加ストレージを利用します。
apiVersion: run.tanzu.vmware.com/v1alpha3
kind: TanzuKubernetesCluster
metadata:
name: gpu-cluster
annotations:
run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu
spec:
distribution:
fullVersion: v1.24.9---vmware.1-tkg.4
version: v1.24.9---vmware.1-tkg.4
settings:
network:
cni:
name: antrea
pods:
cidrBlocks:
- 172.20.0.0/16
serviceDomain: cluster.local
services:
cidrBlocks:
- 10.96.0.0/16
storage:
defaultClass: k8s-storage
topology:
controlPlane:
replicas: 3
storageClass: k8s-storage
vmClass: best-effort-large
tkr:
reference:
name: v1.24.9---vmware.1-tkg.4
nodePools:
- name: gpu-pool
replicas: 2
storageClass: gpu-k8s
vmClass: a100-40g
tkr:
reference:
name: v1.24.9---vmware.1-tkg.4
volumes:
- name: containerd
capacity:
storage: 100Gi
mountPath: /var/lib/containerd
storageClass: k8s-storage
- name: kubelet
capacity:
storage: 50Gi
mountPath: /var/lib/kubelet
storageClass: unity-k8s
- name: best-effort-large
replicas: 2
storageClass: k8s-storage
vmClass: best-effort-large
tkr:
reference:
name: v1.24.9---vmware.1-tkg.4
クラスターの作成が完了すると、3台のコントロールプレーンと、4台のノードで構成されるTanzu Kubernetes Gridクラスターが構成されます。
$ kubectl get node
NAME STATUS ROLES AGE VERSION
gpu-cluster-59lhp-5bvf4 Ready control-plane 6m21s v1.24.9+vmware.1
gpu-cluster-59lhp-jtvsk Ready control-plane 3m43s v1.24.9+vmware.1
gpu-cluster-59lhp-tn6jl Ready control-plane 8m26s v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-6qx6l Ready <none> 6m56s v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-mmrw7 Ready <none> 6m58s v1.24.9+vmware.1
gpu-cluster-gpu-pool-4nqht-575454ddc5-sjxtb Ready <none> 6m39s v1.24.9+vmware.1
gpu-cluster-gpu-pool-4nqht-575454ddc5-wj6dw Ready <none> 6m35s v1.24.9+vmware.1
GPU Operatorのインストール
NVIDIA AI EnterpriseのHelm Repositoryを登録して、GPU Operatorをインストールします。(ライセンス登録などが必要ですが、ここでは省略します。詳細はこちらで説明してます。)
$ kubectl create ns gpu-operator
$ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=${NGC_TOKEN}
$ helm install --wait gpu-operator nvaie/gpu-operator-4-0 -n gpu-operator
GPU Operatorのインストールが完了すると、GPUデバイスプラグイン等のPodが起動します。
$ kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-tntqk 1/1 Running 0 12m
gpu-feature-discovery-xrwj8 1/1 Running 0 12m
gpu-operator-5d49c4db78-8tc6j 1/1 Running 0 13m
gpu-operator-node-feature-discovery-master-5678c7dbb4-zssc7 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-55w27 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-68lb7 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-8tzvd 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-p6pv4 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-t2hkv 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-tvh6b 1/1 Running 0 13m
gpu-operator-node-feature-discovery-worker-zczjg 1/1 Running 0 13m
nvidia-container-toolkit-daemonset-c4tp8 1/1 Running 0 12m
nvidia-container-toolkit-daemonset-nsm49 1/1 Running 0 12m
nvidia-cuda-validator-88v2v 0/1 Completed 0 10m
nvidia-cuda-validator-rzxq2 0/1 Completed 0 10m
nvidia-dcgm-exporter-57254 1/1 Running 0 12m
nvidia-dcgm-exporter-n97kq 1/1 Running 0 12m
nvidia-device-plugin-daemonset-gqgxb 1/1 Running 0 12m
nvidia-device-plugin-daemonset-hcl57 1/1 Running 0 12m
nvidia-driver-daemonset-7sqzs 1/1 Running 0 13m
nvidia-driver-daemonset-jc85m 1/1 Running 0 13m
nvidia-operator-validator-nmwt6 1/1 Running 0 12m
nvidia-operator-validator-nrmjf 1/1 Running 0 12m
vGPUを持つノードにはnvidia.com/gpu.count
ラベルが追加されます。
$ kubectl get node -L nvidia.com/gpu.count
NAME STATUS ROLES AGE VERSION GPU.COUNT
gpu-cluster-59lhp-5bvf4 Ready control-plane 19m v1.24.9+vmware.1
gpu-cluster-59lhp-jtvsk Ready control-plane 16m v1.24.9+vmware.1
gpu-cluster-59lhp-tn6jl Ready control-plane 21m v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-6qx6l Ready <none> 19m v1.24.9+vmware.1
gpu-cluster-best-effort-large-7bq6q-75cdf8b85-mmrw7 Ready <none> 19m v1.24.9+vmware.1
gpu-cluster-gpu-pool-4nqht-575454ddc5-sjxtb Ready <none> 19m v1.24.9+vmware.1 1
gpu-cluster-gpu-pool-4nqht-575454ddc5-wj6dw Ready <none> 19m v1.24.9+vmware.1 1
Kubeflowのデプロイ
vSphere Machine Learning ExtensionではkubeflowがCarvel Packageとして提供されています。
ネームスペースの作成
kubectl create ns carvel-kubeflow
vSphere Machine Learning ExtensionのKubeflow Carvel Packageのリポジトリを追加します。
tanzu package repository add kubeflow-carvel-repo -n carvel-kubeflow ¥
--url projects.registry.vmware.com/kubeflow/kubeflow-carvel-repo:1.6.1
リポジトリが確認されたことを確認します。
$ tanzu package repository list -A
NAMESPACE NAME SOURCE STATUS
carvel-kubeflow kubeflow-carvel-repo (imgpkg) projects.registry.vmware.com/kubeflow/kubeflow-carvel-repo:1.6.1 Reconcile succeeded
追加したリポジトリに含まれるパッケージを確認します。
$ tanzu package available list -n carvel-kubeflow
NAME DISPLAY-NAME
kubeflow.community.tanzu.vmware.com kubeflow
$ tanzu package available list -n carvel-kubeflow kubeflow.community.tanzu.vmware.com
NAME VERSION RELEASED-AT
kubeflow.community.tanzu.vmware.com 1.6.0 -
kubeflow.community.tanzu.vmware.com 1.6.1 -
kubeflow.community.tanzu.vmware.com
のインストールするvalues.yaml
を以下の内容で作成します。
cat <<EOF > values.yaml
service_type: "LoadBalancer"
IP_address: ""
CD_REGISTRATION_FLOW: True
EOF
values.yamlで指定可能なパラメータは以下のコマンドで確認可能です。ダッシュボードの認証に利用するDexの設定が可能です。
tanzu package available get kubeflow.community.tanzu.vmware.com/1.6.1 -n carvel-kubeflow --values-schema
KEY DEFAULT TYPE DESCRIPTION
Dex.config |- issuer: http://dex.auth.svc.cluster.local:5556/dex storage: string Configuration file of Dex
type: kubernetes config: inCluster: true web:
http: 0.0.0.0:5556 logger: level: "debug" format:
text oauth2: skipApprovalScreen: true enablePasswordDB:
true staticPasswords: - email: user@example.com hash:
$2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
# https://github.com/dexidp/dex/pull/1601/commits # FIXME: Use
hashFromEnv instead username: user userID: "15841185641784"
staticClients: # https://github.com/dexidp/dex/pull/1664 - idEnv:
OIDC_CLIENT_ID redirectURIs: ["/login/oidc"] name: 'Dex Login
Application' secretEnv: OIDC_CLIENT_SECRET
Dex.use_external false boolean If set to True, the embedded Dex will not be created, and you will need to
configure OIDC_Authservice with external IdP manually
IP_address "" string EXTERNAL_IP address of istio-ingressgateway, valid only if service_type is
LoadBalancer
OIDC_Authservice.OIDC_PROVIDER http://dex.auth.svc.cluster.local:5556/dex string URL to your OIDC provider. AuthService expects to find information about your
OIDC provider at OIDC_PROVIDER/.well-known/openid-configuration, and will use
this information to contact your OIDC provider and initiate an OIDC flow later
on
OIDC_Authservice.OIDC_SCOPES profile email groups string Comma-separated list of scopes to request access to. The openid scope is always
added.
OIDC_Authservice.OIDC_CLIENT_ID kubeflow-oidc-authservice string AuthService will use this Client ID when it needs to contact your OIDC provider
and initiate an OIDC flow
OIDC_Authservice.OIDC_CLIENT_SECRET pUBnBOY80SnXgjibTYM9ZWNzY2xreNGQok string AuthService will use this Client Secret to authenticate itself against your
OIDC provider in combination with CLIENT_ID when attempting to access your OIDC
Provider's protected endpoints
OIDC_Authservice.REDIRECT_URL /login/oidc string AuthService will pass this URL to the OIDC provider when initiating an OIDC
flow, so the OIDC provider knows where it needs to send the OIDC authorization
code to. It defaults to AUTHSERVICE_URL_PREFIX/oidc/callback. This assumes that
you have configured your API Gateway to pass all requests under a hostname to
Authservice for authentication
OIDC_Authservice.SKIP_AUTH_URI /dex string Comma-separated list of URL path-prefixes for which to bypass authentication.
For example, if SKIP_AUTH_URL contains /my_app/ then requests to <url>/my_app/*
are allowed without checking any credentials. Contains nothing by default
OIDC_Authservice.USERID_CLAIM email string Claim whose value will be used as the userid (default email)
OIDC_Authservice.USERID_HEADER kubeflow-userid string Name of the header containing the user-id that will be added to the upstream
request
OIDC_Authservice.USERID_PREFIX "" string Prefix to add to the userid, which will be the value of the USERID_HEADER
OIDC_Authservice.OIDC_AUTH_URL /dex/auth string AuthService will initiate an Authorization Code OIDC flow by hitting this
URL. Normally discovered automatically through the OIDC Provider's well-known
endpoint
imageswap_labels true boolean Add labels k8s.twr.io/imageswap: enabled to Kubeflow namespaces, which enable
imageswap webhook to swap images.
service_type LoadBalancer string Service type of istio-ingressgateway. Available options: "LoadBalancer" or
"NodePort"
CD_REGISTRATION_FLOW true boolean Turn on Registration Flow, so that Kubeflow Central Dashboard will prompt new
users to create a namespace (profile)
作成したvalues.yaml
を指定して、Kubeflowパッケージをインストールします。
tanzu package install kubeflow -n carvel-kubeflow \
--wait-check-interval 5s \
--wait-timeout 30m0s \
--package kubeflow.community.tanzu.vmware.com \
--version 1.6.1 \
--values-file values.yaml
パッケージのインストールが成功すると、ステータスがReconcile succeeded
になります。
$ tanzu package installed list -n carvel-kubeflow
NAME PACKAGE-NAME PACKAGE-VERSION STATUS
kubeflow kubeflow.community.tanzu.vmware.com 1.6.1 Reconcile succeeded
kubeflow
ネームスペースには、kubeflowに必要な各種Podが作成されます。
$ kubectl get pod -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-deployment-df7f58494-4dllm 1/1 Running 0 9m36s
cache-deployer-deployment-597d44b9db-phfhc 2/2 Running 1 (10m ago) 10m
cache-server-784484f445-bmgwp 2/2 Running 0 10m
centraldashboard-c6f5c9d6c-b7gbf 2/2 Running 0 9m36s
jupyter-web-app-deployment-c658f5f8-xgts8 1/1 Running 0 9m36s
katib-controller-55d9c7c5f-brr5m 1/1 Running 0 9m35s
katib-db-manager-6b5fddbb7f-gwkdb 1/1 Running 0 9m35s
katib-mysql-7b8697c5d4-2j59f 1/1 Running 0 9m36s
katib-ui-57d9844f79-ttmxw 1/1 Running 0 9m35s
kserve-controller-manager-0 2/2 Running 0 9m35s
kserve-models-web-app-7d5f9588dc-vtx59 2/2 Running 0 9m35s
kubeflow-pipelines-profile-controller-7bdfc6bd4d-jqn2g 1/1 Running 0 10m
metacontroller-0 1/1 Running 0 10m
metadata-envoy-deployment-58f6995cf7-vd22p 1/1 Running 0 10m
metadata-grpc-deployment-57f79496b9-4nvdz 2/2 Running 3 (10m ago) 10m
metadata-writer-6f444c4846-kx672 2/2 Running 0 10m
minio-568564b878-rjmh5 2/2 Running 0 10m
ml-pipeline-7466979cdd-r2mtb 2/2 Running 0 10m
ml-pipeline-persistenceagent-64854cfcf-6vqjm 2/2 Running 0 10m
ml-pipeline-scheduledworkflow-86466fc686-4rpfk 2/2 Running 0 10m
ml-pipeline-ui-6c8b547c5c-c9hxx 2/2 Running 0 10m
ml-pipeline-viewer-crd-d6d6f646-tmd4h 2/2 Running 1 (10m ago) 10m
ml-pipeline-visualizationserver-5c959f7d79-m7z8d 2/2 Running 0 10m
mysql-78df8bc87b-d2vrc 2/2 Running 0 10m
notebook-controller-deployment-7bc97485dc-6b2ld 2/2 Running 1 (9m28s ago) 9m36s
profiles-deployment-75478d849c-pzwc7 3/3 Running 1 (8m25s ago) 8m43s
tensorboard-controller-deployment-7b744d76fd-86frx 3/3 Running 1 (7m51s ago) 8m9s
tensorboards-web-app-deployment-c6964646f-ppg4m 1/1 Running 0 8m9s
training-operator-585bc859cd-r5wjr 1/1 Running 0 8m9s
volumes-web-app-deployment-5cd9b6749d-6czsb 1/1 Running 0 8m9s
workflow-controller-5cf5b47b-wsnqz 2/2 Running 1 (10m ago) 10m
Kubeflowの利用
KubeflowはCentral DashboardによりGUIが提供されます。接続先は、Ingress Gatewayで公開されています。
$ kubectl get service -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
authservice ClusterIP 10.96.180.245 <none> 8080/TCP 22m
cluster-local-gateway ClusterIP 10.96.139.166 <none> 15020/TCP,80/TCP 20m
istio-ingressgateway LoadBalancer 10.96.68.134 10.44.187.1 15021:31472/TCP,80:31632/TCP,443:31012/TCP,31400:30853/TCP,15443:32129/TCP 22m
istiod ClusterIP 10.96.133.214 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP 22m
knative-local-gateway ClusterIP 10.96.145.211 <none> 80/TCP 21m
istio-ingressgateway
のEXTERNAL-IPにアクセスするとDexログイン画面が表示されるます。デフォルトのユーザー名・パスワードは、user@example.com
と12341234
です。
初回ログイン時はネームスペース(Profile)の作成求められます。
指定した名前でネームスペース(Profile)が作成され、ログインが完了します。
$ kubectl get profiles.kubeflow.org
NAME AGE
kubeflow-user-example-com 16h
masanara 20m
$ kubectl get namespace masanara
NAME STATUS AGE
masanara Active 21m
メニューからNotebooksを選択し、画面右上の「New Notebook」をクリックすると専用のNotebookを作成することが可能です。
NotebookはPodとして起動するため、利用するDocker Imageを指定することが可能です。今回は、Kubeflowのgitリポジトリを利用してjupyter-tensorflow-cuda:v1.6.1
をビルドして、このイメージをHarborに格納して利用しました。
$ git clone https://github.com/kubeflow/kubeflow.git -b v1.6.1
$ cd kubeflow/components/example-notebook-servers/jupyter-tensorflow-full
$ make docker-build-cuda
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyter-tensorflow-cuda-full v1.6.1 942dbf145dd3 35 minutes ago 7.87GB
jupyter-tensorflow-cuda v1.6.1 0f240490b09d 38 minutes ago 6.77GB
jupyter v1.6.1 284a091577e7 51 minutes ago 1.1GB
base v1.6.1 e6510aa0ed6d 55 minutes ago 437MB
Notebookを作成してしばらくするとStatusがRunning
状態になります。
NotebookリソースはStatefulSetとしてPodとPVCを作成し、VirtualServiceとしてUIが公開されます。
$ kubectl tree notebook demo
W1103 10:24:14.734663 56234 warnings.go:70] Capabilities API in run API group is deprecated, use Capabilities API in core API group
NAMESPACE NAME READY REASON AGE
masanara Notebook/demo - 12m
masanara ├─Service/demo - 12m
masanara │ └─EndpointSlice/demo-rqt8b - 12m
masanara ├─StatefulSet/demo - 12m
masanara │ ├─ControllerRevision/demo-74895549b - 12m
masanara │ └─Pod/demo-0 True 12m
masanara └─VirtualService/notebook-masanara-demo - 12m
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
demo-volume Bound pvc-29f4aa2c-4c2a-46f4-b08c-fc0528496bb9 10Gi RWO k8s-storage 13m
Notebooksの「CONNECT」をクリックするとJupyter Notebookが起動し、nvidia-smi
を実行するとPodからGPUを利用できていることが確認できました。