0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

EKSで機械学習 #5 Cluster AutoScaler設定編

Posted at

はじめに

このシリーズでは Amazon EKSで機械学習を行っていきたいと思います

シリーズ目次

EKSで機械学習 #1 準備編
EKSで機械学習 #2 クラスター作成編
EKSで機械学習 #3 Managed Worker Node作成編
EKSで機械学習 #4 GPU Managed Worker Node作成編
EKSで機械学習 #5 Cluster AutoScaler設定編 (この記事)

この記事の目的

今回は、Worker Node自体をAutoScalingさせるためのClusterAutoScalerを設定します

Cluster AutoScaler用のIAM Policyを作成

https://aws.amazon.com/jp/premiumsupport/knowledge-center/eks-cluster-autoscaler-setup/
こちらを参考にして「ClusterAutoScaler」というIAM policyを作成します。

Policyの中身はこんな感じ

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Resource": "*"
        }
    ]
}

IAM OICD Providerを有効化

本記事では、よく見られるEC2インスタンスにIAM roleをアタッチする方法ではなく、Service Accountを利用します。
以下のコマンドをまず実行して、IAM OICD Providerを有効化します

eksctl utils associate-iam-oidc-provider --cluster=ml --region=us-west-2 --approve
[ℹ]  eksctl version 0.13.0
[ℹ]  using region us-west-2
[ℹ]  will create IAM Open ID Connect provider for cluster "ml" in "us-west-2"
[✔]  created IAM Open ID Connect provider for cluster "ml" in "us-west-2"

ClusterAutoScaler用の service account を作成します

XXXXXXXXX の部分は各自のAWS Account IDで置き換えてください

eksctl create iamserviceaccount --cluster=ml --name=cluster-autoscaler --namespace=kube-system --attach-policy-arn=arn:aws:iam::XXXXXXXXX:policy/ClusterAutoScaler --region=us-west-2 --approve
[ℹ]  eksctl version 0.13.0
[ℹ]  using region us-west-2
[ℹ]  1 iamserviceaccount (kube-system/cluster-autoscaler) was included (based on the include/exclude rules)
[!]  serviceaccounts that exists in Kubernetes will be excluded, use --override-existing-serviceaccounts to override
[ℹ]  1 task: { 2 sequential sub-tasks: { create IAM role for serviceaccount "kube-system/cluster-autoscaler", create serviceaccount "kube-system/cluster-autoscaler" } }
[ℹ]  building iamserviceaccount stack "eksctl-ml-addon-iamserviceaccount-kube-system-cluster-autoscaler"
[ℹ]  deploying stack "eksctl-ml-addon-iamserviceaccount-kube-system-cluster-autoscaler"
[ℹ]  created serviceaccount "kube-system/cluster-autoscaler"

service accountが作成されたことを念の為確認

k get sa -n kube-system | grep cluster-autoscaler
cluster-autoscaler                   1         119s

Cluster AutoScalerのデプロイします

以下のymlを作成して実行します

# ---
# apiVersion: v1
# kind: ServiceAccount
# metadata:
#  labels:
#    k8s-addon: cluster-autoscaler.addons.k8s.io
#    k8s-app: cluster-autoscaler
#  name: cluster-autoscaler
#  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["events", "endpoints"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["endpoints"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["watch", "list", "get", "update"]
  - apiGroups: [""]
    resources:
      - "pods"
      - "services"
      - "replicationcontrollers"
      - "persistentvolumeclaims"
      - "persistentvolumes"
    verbs: ["watch", "list", "get"]
  - apiGroups: ["extensions"]
    resources: ["replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["watch", "list"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses", "csinodes"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create"]
  - apiGroups: ["coordination.k8s.io"]
    resourceNames: ["cluster-autoscaler"]
    resources: ["leases"]
    verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create","list","watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
    verbs: ["delete", "get", "update", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: k8s.gcr.io/cluster-autoscaler:v1.14.7
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
            #- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-bundle.crt"

ポイントは

  • service accountはこのymlでは作成しない(コメントアウトしている)
  • 以下のように、 cluster名を設定に入れないようにしている。よりymlを汎用的にするため。推奨設定ではないらしいが。
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
            #- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>

確認

以下のymlを実行して、auto scalingが発動することを確認する

gpu-app-deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app-deployment
  labels:
    app: gpu-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      containers:
        - name: digits-container
          image: nvidia/digits:6.0
          resources:
            limits:
              nvidia.com/gpu: 1 
k apply -f gpu-app-deployment.yml 
deployment.apps/gpu-app-deployment created

# リソース(GPU)が不足しているため、2つ目のPodが Pending のままになっているが、しばらくすると
# Cluster AutoScalerが発動して、EC2ノードが一台増えて(2台 -> 3台)、podが作成されたことがわかる
k get pod -w
NAME                                  READY   STATUS    RESTARTS   AGE
gpu-app-deployment-5998b859d9-cfn66   1/1     Running   0          42s
gpu-app-deployment-5998b859d9-w6w6c   0/1     Pending   0          42s
gpu-app-deployment-5998b859d9-w6w6c   0/1     Pending   0          107s
gpu-app-deployment-5998b859d9-w6w6c   0/1     Pending   0          2m17s
gpu-app-deployment-5998b859d9-w6w6c   0/1     ContainerCreating   0          2m17s
gpu-app-deployment-5998b859d9-w6w6c   1/1     Running             0          3m23s


k get node
NAME                                            STATUS   ROLES    AGE   VERSION
ip-192-168-109-12.us-west-2.compute.internal    Ready    <none>   19h   v1.14.8-eks-b8860f
ip-192-168-134-74.us-west-2.compute.internal    Ready    <none>   72m   v1.14.8-eks-b8860f
ip-192-168-178-187.us-west-2.compute.internal   Ready    <none>   86s   v1.14.8-eks-b8860f
PowerUser:~/environment $ 

最後に後始末


k delete -f gpu-app-deployment.yml 
deployment.apps "gpu-app-deployment" deleted

まとめ

今回は、Cluster AutoScalerの設定を行い、実際にGPUインスタンス数が増えたことを確認しました。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?