Help us understand the problem. What is going on with this article?

Initializersを使ってK8s + NVIDIA device pluginでの意図しないGPUの割当を防止してみる

背景

  • Kubernetes + NVIDIA device plugin for Kubernetes + nvidia-dockerの環境でGPUをリクエストせずにコンテナを建てると、すべてのGPUがアタッチされてしまう(例えば悪意のあるユーザがQuota等を無視してコンテナが立ったノードにあるGPUをすべて使えてしまう)という問題がある。
    https://github.com/NVIDIA/k8s-device-plugin#running-gpu-jobs

指定した場合(正常動作)

gpu-request.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
$ kubectl apply -f gpu-request.yaml
$ kubectl logs gpu-pod
Sat Dec  1 08:39:42 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   38C    P0    36W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

指定しない場合(問題になるとき)

gpu-norequest.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-norequest
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["nvidia-smi"]
$ kubectl apply -f gpu-norequest.yaml
$ kubectl logs gpu-pod-norequest
Sat Dec  1 08:46:25 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    35W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    34W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    32W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   38C    P0    36W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   37C    P0    33W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   38C    P0    33W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   37C    P0    32W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

解決例

今回の方針

基本的にはPFN のパクリ を参考に

  • nvidia.com/gpuのresourcesが書いてないPod
  • nvidia.com/gpu:0のPod
    • これもDevice Pluginがフックできないため全GPU見える

に対して強制的に環境変数NVIDIA_VISIBLE_DEVICES=noneを注入する。

相違点として、今回はDynamic Admission ControlにAdmission Webhooks ではなく、Initializersを使う。

Initializerとは

Initializers are useful for admins to force policies (e.g., the AlwaysPullImages admission controller), or to inject defaults (e.g., the DefaultStorageClass admission controller), etc.

Dynamic Admission Controlの一種で、ユーザのリソースのリクエストと実際に立ち上がるまでの間に介入してリクエストを書き換えたりすることができる。アドミンがポリシーを強制したいときなどに使う。

Initializerの動き方

Initializerの実装には大きく分けて

  • InitializerConfiguration
  • Initializer Controller

の2つが必要になる。

InitializerConfiguration

InitializerConfigurationは単純にK8sのリソース。
以下のようなマッチ条件を書いたInitializerConfigurationを作ると、マッチしたリソースのmetadata.initializers[]に自動で指定したinitializerが登録される。

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: InitializerConfiguration
metadata:
  name: gpu
initializers:
  - name: gpu.initializer.kubernetes.io
    rules:
      - apiGroups:
          - "*"
        apiVersions:
          - "v1"
        resources:
          - pods

今回だと、すべてのpodsのmetadata.initializers[]にgpu.initializer.kubernetes.ioが追加される。
metadata.initializers[]が存在するリソースはInitializerが処理を行うまでPendingされる。

Initializer Controller

Initializer Controllerは自分で実装する必要がある。
どうやって実装してもいいが、基本的にはClusterAdmin権限のServiceAccountを使ってすべての対象リソース(今回の場合はPod)を監視して

  • リソースの内容を書き換え
    • (NVIDIA_VISIBLE_DEVICES=noneを注入)
  • metadata.initializers[]から自分を削除
    • (gpu.initializer.kubernetes.ioを削除)

して、applyするようなプログラムを常駐させる。

全体の流れ

initializers.png
ちなみにInitializerConfigurationは直接リソースを作った場合だけでなく、ReplicaSetやDaemonSetなどから関節的にPodが作られた場合などでも適用される。
※実際にinitializersのmetadataを入れるのはInitializerConfigurationを読み取ったkube-apiserver

動かしてみる

Initializersの有効化

Initializersはalpha機能なので、デフォルトではオンになっていない。
Enable initializers alpha featureを参考に有効にする。
GKEの場合にはアルファ機能を有効にしたクラスタを作成する。

デプロイ

$ kubectl apply -f https://raw.githubusercontent.com/takmatsu/gpu-initializer/master/manifests/rolebindings.yaml
$ kubectl apply -f https://raw.githubusercontent.com/takmatsu/gpu-initializer/master/manifests/configmaps.yaml
$ kubectl apply -f https://raw.githubusercontent.com/takmatsu/gpu-initializer/master/manifests/deployment.yaml
$ kubectl apply -f https://raw.githubusercontent.com/takmatsu/gpu-initializer/master/manifests/initializer-configuration.yaml

確認

$ kubectl apply -f gpu-norequest.yaml
$ kubectl describe pod gpu-pod-norequest
~
Containers:
  cuda-container:
    Container ID:  docker://b4db40e7c6d23d0e8a982fcfd8dee590c3fa517208a4e340a3cfc2880c859e7b
    Image:         nvidia/cuda:9.0-devel
    Image ID:      docker-pullable://nvidia/cuda@sha256:3cdf1b5becfde8772e15dab594bc76de1cbbefd6c0f8533748854ab47e109ad1
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-smi
    State:          Terminated
      Reason:       Error
      Exit Code:    6
      Started:      Sat, 01 Dec 2018 18:20:25 +0900
      Finished:     Sat, 01 Dec 2018 18:20:25 +0900
    Last State:     Terminated
      Reason:       Error
      Exit Code:    6
      Started:      Sat, 01 Dec 2018 18:19:55 +0900
      Finished:     Sat, 01 Dec 2018 18:19:55 +0900
    Ready:          False
    Restart Count:  3
    Environment:
      NVIDIA_VISIBLE_DEVICES:  none
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8z4dk (ro)
~
$ kubectl logs gpu-pod-norequest
No devices were found

自動でNVIDIA_VISIBLE_DEVICES: noneが入っている🎉

しかし・・・

おいおいおい〜〜〜〜〜〜!!!

まとめ

  • Mutating Admission Webhook(1.9で既にbetaでデフォルト有効)を使おう!
  • alpha機能を使うときはよく考えよう!
  • コミュニティの動向を追うのは大事!

Appendix

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away