More than 1 year has passed since last update.

Oracle Cloud Infrastructure Advent Calendar 2023

@takuya_0301(Takuya Niita)in

日本オラクル株式会社

OKEでTime-slicing GPU試してみた

Last updated at 2023-12-14Posted at 2023-12-14

はじめに

この記事は、Oracle Cloud Infrastructure Advent Calender 2023 シリーズ 1 Day15の記事として書いています。

皆さんはTime-slicing GPUをご存知でしょうか？

昨今の生成AIや大規模言語モデル(LLM)の登場によるAI戦争で、今やGPUは争奪戦となっています。

そんなGPUですが、CPUのように気軽なオーバサブスクリプションを実現するのはまだ難しいです。

そこで今回は、1つのGPUを複数のコンテナ上にスケジューリングできるTime-slicing GPUを試してみようと思います。

これを利用すると、Kubernetes上で1つのGPUを複数Podで利用することができるようになります。

Kubernetesでは、基本的にNVIDIA社が提供しているk8s-device-pluginを利用してKubernetesにGPUを認識させます。

最新のk8s-device-pluginではTime-slicing GPUが利用可能になっていますが、OKEではまだTime-slicing GPUのサポート対象バージョンのk8s-device-pluginが含まれていません。

そこで、今回はNVIDIA GPU Operatorを利用してTime-slicing GPUをやっていきます。

それでは始めましょう。

OKEのプロビジョニング

OKEのプロビジョニングについてはいつも通りこちらのチュートリアルを参考に行なっていただきたいですが、以下について考慮します。

GPUノードプールの他にCPUノードプールも作成
- NVIDIA GPU OperatorではCPU(のみの)ノードで動作するコンポーネントがあります
- この記事ではGPUノードとしてVM.GPU.A10.1(GPU1枚)1ノードを利用します
GPUノードプールではcloud-initで以下のスクリプトを実行する
- /sbin/ldconfigに対するエイリアス(/sbin/ldconfig.real)を作成しないとNVIDIA GPU Operatorの初期化処理が正常に完了しません

#!/bin/bash
curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
bash /var/run/oke-init.sh
sudo ln -s /sbin/ldconfig /sbin/ldconfig.real

NVIDIA GPU Operatorのインストール

NVIDIA GPU OperatorのインストールはHelmを利用して簡単に実施できます。

まずは以下のvaules.yamlを作成します。

values.yaml

# Default values for gpu-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

platform:
  openshift: false

nfd:
  enabled: true
  nodefeaturerules: false

# deprecated: use PodSecurityAdmission (PSA) controls instead
psp:
  enabled: false

psa:
  enabled: false

cdi:
  enabled: false
  default: false

sandboxWorkloads:
  enabled: false
  defaultWorkload: "container"

daemonsets:
  labels: {}
  annotations: {}
  priorityClassName: system-node-critical
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # configuration for controlling update strategy("OnDelete" or "RollingUpdate") of GPU Operands
  # note that driver Daemonset is always set with OnDelete to avoid unintended disruptions
  updateStrategy: "RollingUpdate"
  # configuration for controlling rolling update of GPU Operands
  rollingUpdate:
    # maximum number of nodes to simultaneously apply pod updates on.
    # can be specified either as number or percentage of nodes. Default 1.
    maxUnavailable: "1"

validator:
  repository: nvcr.io/nvidia/cloud-native
  image: gpu-operator-validator
  # If version is not specified, then default is to use chart.AppVersion
  version: v23.9.0
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env: []
  args: []
  resources: {}
  plugin:
    env:
      - name: WITH_WORKLOAD
        value: "false"

operator:
  repository: nvcr.io/nvidia
  image: gpu-operator
  # If version is not specified, then default is to use chart.AppVersion
  version: v23.9.0
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  priorityClassName: system-node-critical
  defaultRuntime: docker
  runtimeClass: nvidia
  use_ocp_driver_toolkit: false
  # cleanup CRD on chart un-install
  cleanupCRD: false
  # upgrade CRD on chart upgrade, requires --disable-openapi-validation flag
  # to be passed during helm upgrade.
  upgradeCRD: false
  initContainer:
    image: cuda
    repository: nvcr.io/nvidia
    version: 12.2.2-base-ubi8
    imagePullPolicy: IfNotPresent
  tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      value: ""
      effect: "NoSchedule"
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Equal"
      value: ""
      effect: "NoSchedule"
  annotations:
    openshift.io/scc: restricted-readonly
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: "node-role.kubernetes.io/master"
                operator: In
                values: [""]
        - weight: 1
          preference:
            matchExpressions:
              - key: "node-role.kubernetes.io/control-plane"
                operator: In
                values: [""]
  logging:
    # Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano')
    timeEncoding: epoch
    # Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
    level: info
    # Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn)
    # Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
    develMode: false
  resources:
    limits:
      cpu: 500m
      memory: 350Mi
    requests:
      cpu: 200m
      memory: 100Mi

mig:
  strategy: single

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: true
    driverType: gpu
    nodeSelector: {}
  # use pre-compiled packages for NVIDIA driver installation.
  # only supported for as a tech-preview feature on ubuntu22.04 kernels.
  usePrecompiled: false
  repository: nvcr.io/nvidia
  image: driver
  version: "535.104.12"
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  startupProbe:
    initialDelaySeconds: 60
    periodSeconds: 10
    # nvidia-smi can take longer than 30s in some cases
    # ensure enough timeout is set
    timeoutSeconds: 60
    failureThreshold: 120
  rdma:
    enabled: false
    useHostMofed: false
  upgradePolicy:
    # global switch for automatic upgrade feature
    # if set to false all other options are ignored
    autoUpgrade: true
    # how many nodes can be upgraded in parallel
    # 0 means no limit, all nodes will be upgraded in parallel
    maxParallelUpgrades: 1
    # maximum number of nodes with the driver installed, that can be unavailable during
    # the upgrade. Value can be an absolute number (ex: 5) or
    # a percentage of total nodes at the start of upgrade (ex:
    # 10%). Absolute number is calculated from percentage by rounding
    # up. By default, a fixed value of 25% is used.'
    maxUnavailable: 25%
    # options for waiting on pod(job) completions
    waitForCompletion:
      timeoutSeconds: 0
      podSelector: ""
    # options for gpu pod deletion
    gpuPodDeletion:
      force: false
      timeoutSeconds: 300
      deleteEmptyDir: false
    # options for node drain (`kubectl drain`) before the driver reload
    # this is required only if default GPU pod deletions done by the operator
    # are not sufficient to re-install the driver
    drain:
      enable: false
      force: false
      podSelector: ""
      # It's recommended to set a timeout to avoid infinite drain in case non-fatal error keeps happening on retries
      timeoutSeconds: 300
      deleteEmptyDir: false
  manager:
    image: k8s-driver-manager
    repository: nvcr.io/nvidia/cloud-native
    version: v0.6.4
    imagePullPolicy: IfNotPresent
    env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      - name: DRAIN_USE_FORCE
        value: "false"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "false"
  env: []
  resources: {}
  # Private mirror repository configuration
  repoConfig:
    configMapName: ""
  # custom ssl key/certificate configuration
  certConfig:
    name: ""
  # vGPU licensing configuration
  licensingConfig:
    configMapName: ""
    nlsEnabled: true
  # vGPU topology daemon configuration
  virtualTopology:
    config: ""
  # kernel module configuration for NVIDIA driver
  kernelModuleConfig:
    name: ""

toolkit:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: container-toolkit
  version: v1.14.3-ubuntu20.04
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env: []
  resources: {}
  installDir: "/usr/local/nvidia"

devicePlugin:
  enabled: true
  repository: nvcr.io/nvidia
  image: k8s-device-plugin
  version: v0.14.2-ubi8
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  args: []
  env:
    - name: PASS_DEVICE_SPECS
      value: "true"
    - name: FAIL_ON_INIT_ERROR
      value: "true"
    - name: DEVICE_LIST_STRATEGY
      value: envvar
    - name: DEVICE_ID_STRATEGY
      value: uuid
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
  resources: {}
  # Plugin configuration
  # Use "name" to either point to an existing ConfigMap or to create a new one with a list of configurations(i.e with create=true).
  # Use "data" to build an integrated ConfigMap from a set of configurations as
  # part of this helm chart. An example of setting "data" might be:
  # config:
  #   name: device-plugin-config
  #   create: true
  #   data:
  #     default: |-
  #       version: v1
  #       flags:
  #         migStrategy: none
  #     mig-single: |-
  #       version: v1
  #       flags:
  #         migStrategy: single
  #     mig-mixed: |-
  #       version: v1
  #       flags:
  #         migStrategy: mixed
  config:
    # Create a ConfigMap (default: false)
    create: true
    # ConfigMap name (either exiting or to create a new one with create=true above)
    name: "nvidia-device-plugin"
    # Default config name within the ConfigMap
    default: "any"
    # Data section for the ConfigMap to create (i.e only applies when create=true)
    data:
      {
        "any": "version: v1\nflags:\n  migStrategy: none\nsharing:\n  timeSlicing:\n    resources:\n    - name: nvidia.com/gpu\n      replicas: 10",
      }
# standalone dcgm hostengine
dcgm:
  # disabled by default to use embedded nv-hostengine by exporter
  enabled: false
  repository: nvcr.io/nvidia/cloud-native
  image: dcgm
  version: 3.2.6-1-ubuntu20.04
  imagePullPolicy: IfNotPresent
  hostPort: 5555
  args: []
  env: []
  resources: {}

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.2.6-3.1.9-ubuntu20.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []
    # - source_labels:
    #     - __meta_kubernetes_pod_node_name
    #   regex: (.*)
    #   target_label: instance
    #   replacement: $1
    #   action: replace

gfd:
  enabled: true
  repository: nvcr.io/nvidia
  image: gpu-feature-discovery
  version: v0.8.2-ubi8
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env:
    - name: GFD_SLEEP_INTERVAL
      value: 60s
    - name: GFD_FAIL_ON_INIT_ERROR
      value: "true"
  resources: {}

migManager:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: k8s-mig-manager
  version: v0.5.5-ubuntu20.04
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env:
    - name: WITH_REBOOT
      value: "false"
  resources: {}
  config:
    name: "default-mig-parted-config"
    default: "all-disabled"
  gpuClientsConfig:
    name: ""

nodeStatusExporter:
  enabled: false
  repository: nvcr.io/nvidia/cloud-native
  image: gpu-operator-validator
  # If version is not specified, then default is to use chart.AppVersion
  #version: ""
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  resources: {}

gds:
  enabled: false
  repository: nvcr.io/nvidia/cloud-native
  image: nvidia-fs
  version: "2.16.1"
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env: []
  args: []

vgpuManager:
  enabled: false
  repository: ""
  image: vgpu-manager
  version: ""
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env: []
  resources: {}
  driverManager:
    image: k8s-driver-manager
    repository: nvcr.io/nvidia/cloud-native
    version: v0.6.4
    imagePullPolicy: IfNotPresent
    env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "false"
      - name: ENABLE_AUTO_DRAIN
        value: "false"

設定値は多くありますが、重要なポイントは以下です。
ここにNVIDIA GPU Operatorが読み込むTime-slicing GPUのConfigMapを設定します。

  config:
    # Create a ConfigMap (default: false)
    create: true
    # ConfigMap name (either exiting or to create a new one with create=true above)
    name: "nvidia-device-plugin"
    # Default config name within the ConfigMap
    default: "any"
    # Data section for the ConfigMap to create (i.e only applies when create=true)
    data:
      {
        "any": "version: v1\nflags:\n  migStrategy: none\nsharing:\n  timeSlicing:\n    resources:\n    - name: nvidia.com/gpu\n      replicas: 10",
      }

createはNVIDIA GPU Operatorインストール時にConfigMapを自動で生成するかのフラグです。
今回は自動生成にしますが、あらかじめ作成しておいても問題ありません。この場合はこのフラグをfalseにします。

次にnameはそのままConfigMapの名前です。
create=trueの場合は自動生成するConfigMap名を設定し、create=falseの場合は、あらかじめ作成したConfigMap名を設定します。

次にdefaultはConfigMap内の設定名です。
create=trueの場合は自動生成するConfigMap内の設定名を設定し、create=falseの場合は、あらかじめ作成したConfigMap内の設定名を設定します。

最後にdataです。
create=trueの場合は自動生成するConfigMap内のdataフィールドを設定し、create=falseの場合は、設定不要です。

今回はこのようなdataフィールドにしています。
replicas: 10になっていますが、これはスライスする数を10にするという意味です。
これで10個分のコンテナ上を同じGPU上でスケジューリングできます。

  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 10

上記のvaules.yamlを利用してインストールします。
今回はkube-systemネームスペースにインストールします。

$ helm install -f values.yaml -n kube-system --generate-name --wait nvidia/gpu-operator

インストールすると、初期化処理が行われ、最終的には以下のような状態になります。
kube-systemのみ、かつNVIDIA GPU Operatorのみのコンポーネントを表示しています。

kube-system   gpu-feature-discovery-tfk4l                                       2/2     Running     0               26m
kube-system   gpu-operator-1701944502-node-feature-discovery-gc-5f55fb85942td   1/1     Running     0               26m
kube-system   gpu-operator-1701944502-node-feature-discovery-master-55f5k2rs7   1/1     Running     0               26m
kube-system   gpu-operator-1701944502-node-feature-discovery-worker-dzwbx       1/1     Running     0               26m
kube-system   gpu-operator-1701944502-node-feature-discovery-worker-jqsm5       1/1     Running     0               26m
kube-system   gpu-operator-1701944502-node-feature-discovery-worker-pmwkj       1/1     Running     0               26m
kube-system   gpu-operator-1701944502-node-feature-discovery-worker-ztd4x       1/1     Running     0               26m
kube-system   gpu-operator-85b78946dd-2pb65                                     1/1     Running     0               26m
kube-system   nvidia-container-toolkit-daemonset-4r29g                          1/1     Running     0               26m
kube-system   nvidia-cuda-validator-78jgz                                       0/1     Completed   0               26m
kube-system   nvidia-dcgm-exporter-tcmkp                                        1/1     Running     0               26m
kube-system   nvidia-device-plugin-daemonset-pmg6l                              2/2     Running     0               26m
kube-system   nvidia-gpu-device-plugin-z25qz                                    1/1     Running     0               47m
kube-system   nvidia-operator-validator-dkxlq                                   1/1     Running     0               26m

これでNVIDIA GPU Operatorのインストールは完了です。

動作確認

NVIDIA GPU Operatorがインストールされると、Nodeのキャパシティで認識されるGPUリソースが変化します。

インストール前は以下のようになっています。
今回はVM.GPU.A10.1を利用しているので、"nvidia.com/gpu"は1になります。

$ kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}'
{
  "name": "10.0.10.63",
  "capacity": {
    "cpu": "30",
    "ephemeral-storage": "37177616Ki",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "246857100Ki",
    "nvidia.com/gpu": "1",
    "pods": "31"
  }
}

今回はスライス数を10に設定したので、インストール後は"nvidia.com/gpu": "10"になります。

$ kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}'
{
  "name": "10.0.10.63",
  "capacity": {
    "cpu": "30",
    "ephemeral-storage": "37177616Ki",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "246857100Ki",
    "nvidia.com/gpu": "10",
    "pods": "31"
  }
}

この状態で適当なDeploymentを作成してみましょう。

nginx-gpu.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-nginx
  labels:
    app: nginx
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          resources:
            limits:
              nvidia.com/gpu: 1

$ kubectl apply -f nginx-gpu.yaml

これをNVIDIA GPU Operatorなしでデプロイすると利用できるGPUが1つしかないので、以下のように1個以外のPodはPendingになります。

$ kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
gpu-nginx-6678649c6-kxnkv   1/1     Running   0          31s
gpu-nginx-6678649c6-mlb5j   0/1     Pending   0          31s
gpu-nginx-6678649c6-nddht   0/1     Pending   0          31s
gpu-nginx-6678649c6-rzk7g   0/1     Pending   0          31s
gpu-nginx-6678649c6-xrwnq   0/1     Pending   0          31s

今回の環境ではGPUをTime-slicing GPUによって10レプリカにしているので、以下のように全てのPodが起動します。

$ kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
gpu-nginx-6678649c6-kxnkv   1/1     Running   0          31s
gpu-nginx-6678649c6-mlb5j   1/1     Running   0          31s
gpu-nginx-6678649c6-nddht   1/1     Running   0          31s
gpu-nginx-6678649c6-rzk7g   1/1     Running   0          31s
gpu-nginx-6678649c6-xrwnq   1/1     Running   0          31s

これで無事にNVIDIA GPU Operatorを利用したTime-slicing GPUが実現できました！

まとめ

今回はNVIDIA GPU Operatorを利用したTime-slicing GPUを試してみました。
いずれはOKEでもk8s-device-pluginによって実現できるようになると思います。

GPUはまだまだ柔軟に（かつリーズナブルに)リソースを活用できるという状況には至っていませんが、今後様々なテクノロジーによって実現されていくと思います。

今回ご紹介したTime-slicing GPUですが、あくまでも1つのGPUを共有するので比例して処理能力が向上したりするものではありません。
また、メモリ制限は考慮されないため(つまりOOMが発生します)、GPUワークロード側でメモリを制御する仕組みを導入したりなどを検討する必要があります。
そのような検討をせずに安全にGPUを利用したい場合は、多少コストがかかってもA100/H100などのハイエンド向けGPUインスタンスでMulti-Instance GPU(MIG)などを利用することを推奨します。
OCIではOKEやA100インスタンスを利用してMIGを導入できます。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up