ZOZOAdvent Calendar 2024

Ray ClusterのWorker NodeをSpot VMで起動する

Last updated at 2024-12-05Posted at 2024-12-05

これは ZOZO Advent Calendar 2024 カレンダー Vol.9 の 6日目の記事です。

はじめに

昨日はRay ClusterのWorker NodeでService Accountを指定するというタイトルで、Ray Clusterに関する記事を投稿しました。
昨日に引き続き、本日もRay Cluster関連の投稿になります。Ray Clusterについては1日目の記事（GKE上に構成したRay Clusterに外部Load Balancer経由でアクセスする）で簡単に説明しているため、ご参照ください。

本記事では、Ray ClusterのWorker NodeのPodをGoogle CloudのSpot VMで起動する方法をご紹介します。

Spot VMの特徴

Google CloudのSpot VMは、いくつかの制約がある代わりに標準VMの価格より低価格に利用できるVMです。
特に、Google Cloud側でキャパシティが不足した際に、任意のタイミングでプリエンプトされる可能性があるという特徴を持っています。一方で60~91%と割引率も大きいです。

APIなど常時VMを起動しておきたいケースにはあまり向きませんが、バッチ処理など比較的短い期間でVMを利用したい場合はコストを抑えることができます。Ray ClusterのWorker Nodeについても単発でのバッチ処理がほとんどであるため、Spot VMの利用は向いています。

Ray ClusterのWorker NodeにSpot VMを利用する

Google Kubernetes Engine（以下GKE）のNodeでもSpot VMを利用することができます。
ここではSpot VMを起動するNodePoolを事前定義し、Ray ClusterのWorker NodeのPodを事前定義したSpot VMのNodeで起動します。

Ray ClusterのWorker NodeのPodを事前定義したNodeで起動する

NodePoolの作成

以下はTerraformでNodePoolを作成する場合のterraformリソースの定義になります。（ref: google_container_node_pool）
node_config.spotフィールドに trueを指定することでSpot VMとして起動できます。またここでは、Nodeのtaintも設定しています。これにより、対応するtolerationを付与したPod以外はスケジュールできないようにしています。

nodepool.tf

resource "google_container_node_pool" "ray-cluster-node-pool" {
  provider = google-beta
  for_each = { # key: pool-name, value: machine_type
    small-worker                   = { machine_type = "e2-standard-2", workload_metadata_config_mode = "GKE_METADATA", spot = true },
    medium-worker                  = { machine_type = "e2-standard-4", workload_metadata_config_mode = "GKE_METADATA", spot = true },
  }

  name               = each.key
  location           = "us-central1"
  node_locations     = try(each.value["node_locations"], "us-central1")
  cluster            = <cluster name>
  initial_node_count = 0 # per zone

  autoscaling {
    min_node_count  = try(each.value["min_node_count"], 0)  # per zone
    max_node_count  = try(each.value["max_node_count"], 50) # per zone
    location_policy = "ANY"
  }

  timeouts {
    create = "60m"
    update = "60m"
    delete = "60m"
  }

  node_config {
    machine_type = each.value["machine_type"]
    spot         = each.value["spot"]
    disk_size_gb = "100"
    metadata = {
      disable-legacy-endpoints = "true"
    }
    workload_metadata_config {
      mode = each.value["workload_metadata_config_mode"]
    }
    taint {
      effect = "NO_SCHEDULE"
      key    = "dedicated"
      value  = each.key
    }
    service_account = google_service_account.gke-ray-node-default.email
    oauth_scopes    = ["https://www.googleapis.com/auth/cloud-platform"]

    guest_accelerator {
      type  = try(each.value["guest_accelerator_type"], "nvidia-tesla-t4")
      count = try(each.value["guest_accelerator_count"], 0) # gpu per node
    }
  }
  management {
    auto_repair  = true
    auto_upgrade = true
  }
  upgrade_settings {
    max_surge       = 1
    max_unavailable = 0
  }
  lifecycle {
    ignore_changes = [
      managed_instance_group_urls,
      node_config.0.taint,
    ]
  }
}

Worker Nodeの設定

Ray ClusterのWorker Nodeを起動するPodの設定はRayClusterオブジェクトのマニフェストに記述できます。

Ray Clusterリソースの型定義は次のようになっています。workerGroupSpec.templateフィールド配下は通常Podを作成するのと同じ型定義になっていることがわかります。

raycluster_types.go

type WorkerGroupSpec struct {
	// we can have multiple worker groups, we distinguish them by name
	GroupName string `json:"groupName"`
	// Replicas is the number of desired Pods for this worker group. See https://github.com/ray-project/kuberay/pull/1443 for more details about the reason for making this field optional.
	// +kubebuilder:default:=0
	Replicas *int32 `json:"replicas,omitempty"`
	// MinReplicas denotes the minimum number of desired Pods for this worker group.
	// +kubebuilder:default:=0
	MinReplicas *int32 `json:"minReplicas"`
	// MaxReplicas denotes the maximum number of desired Pods for this worker group, and the default value is maxInt32.
	// +kubebuilder:default:=2147483647
	MaxReplicas *int32 `json:"maxReplicas"`
	// IdleTimeoutSeconds denotes the number of seconds to wait before the v2 autoscaler terminates an idle worker pod of this type.
	// This value is only used with the Ray Autoscaler enabled and defaults to the value set by the AutoscalingConfig if not specified for this worker group.
	IdleTimeoutSeconds *int32 `json:"idleTimeoutSeconds,omitempty"`
	// RayStartParams are the params of the start command: address, object-store-memory, ...
	RayStartParams map[string]string `json:"rayStartParams"`
	// Template is a pod template for the worker
	Template corev1.PodTemplateSpec `json:"template"`
	// ScaleStrategy defines which pods to remove
	ScaleStrategy ScaleStrategy `json:"scaleStrategy,omitempty"`
	// NumOfHosts denotes the number of hosts to create per replica. The default value is 1.
	// +kubebuilder:default:=1
	NumOfHosts int32 `json:"numOfHosts,omitempty"`
}

通常のPodで指定するのと同様の記述をworkerGroupSpecs.template配下に記述することで、Ray ClusterのWorker Nodeを起動するPodのNodeAffinityやTolerationの指定が可能です。

先に作成したNodePoolに対応するNodeAffinity・Toleration次のように指定することでRay ClusterのWorker NodeをSpot VMで起動できます。

ray-cluster.yaml

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  ...省略
spec:
  autoscalerOptions:
    ...省略
  enableInTreeAutoscaling: true
  headGroupSpec:
    ...省略
  rayVersion: 2.22.0
  workerGroupSpecs:
  - groupName: small-group
    maxReplicas: 1
    minReplicas: 0
    rayStartParams: {}
    replicas: 0
    template:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cloud.google.com/gke-nodepool
                  operator: In
                  values:
                  - small-worker
                - key: cloud.google.com/gke-spot
                  operator: In
                  values:
                  - "true"
        containers:
        - image: rayproject/ray:2.22.0-py311
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - ray stop
          name: ray-worker
          resources:
            limits:
              cpu: "1"
              memory: 4G
            requests:
              cpu: "1"
              memory: 4G
        tolerations:
        - effect: NoSchedule
          key: dedicated
          operator: Equal
          value: small-worker
  - groupName: medium-group
    maxReplicas: 1
    minReplicas: 0
    rayStartParams: {}
    replicas: 0
    template:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cloud.google.com/gke-nodepool
                  operator: In
                  values:
                  - medium-worker
                - key: cloud.google.com/gke-spot
                  operator: In
                  values:
                  - "true"
        containers:
        - image: rayproject/ray:2.22.0-py311
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - ray stop
          name: ray-worker
          resources:
            limits:
              cpu: "3"
              memory: 12G
            requests:
              cpu: "3"
              memory: 12G
        tolerations:
        - effect: NoSchedule
          key: dedicated
          operator: Equal
          value: medium-worker

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up