AWSでKubernetesを活用したMLOps運用手順

Posted at 2025-03-18

1. はじめに

機械学習モデルの本番運用（MLOps）において、スケーラビリティと可用性を確保するために Kubernetes（EKS） を活用することは一般的になっています。AWSの Amazon EKS（Elastic Kubernetes Service） を使うことで、MLOpsワークフローを効率的に管理・運用できます。

本記事では、AWS上でKubernetesを活用してMLOpsを運用する手順を 実装レベル で解説します。

2. AWS Kubernetes（EKS）を使うメリット

✅ オートスケール対応 - 推論APIの負荷に応じて自動スケール
✅ コンテナ化による柔軟なデプロイ - Docker化したMLモデルをデプロイ可能
✅ CI/CDとの統合が容易 - GitHub Actions、ArgoCDなどと連携可能
✅ コスト最適化 - スポットインスタンスやFargateでコスト削減
✅ セキュアな運用 - IAM, VPC, EKSのRBACでアクセス制御が可能

3. AWSでKubernetes（EKS）を使ったMLOps構築の流れ

MLOpsのワークフローをAWS EKS上で構築する際の主要なステップは以下の通りです。

EKSクラスタの作成
S3, ECR, Sagemakerとの連携
MLモデルのDockerコンテナ化
KServe（旧KFServing）で推論APIをデプロイ
Argo Workflowsでトレーニングパイプラインを構築
Prometheus + Grafanaで監視
CI/CDの自動化（GitHub Actions + ArgoCD）

4. EKSクラスタの作成

AWSでKubernetes（EKS）をセットアップするには、AWS CLIまたはTerraformを使用できます。

🛠 AWS CLIでEKSクラスタを作成する

# EKSクラスタを作成
aws eks create-cluster --name mlops-cluster --region us-west-2 \
    --role-arn arn:aws:iam::123456789012:role/EKSRole \
    --resources-vpc-config subnetIds=subnet-abc123,securityGroupIds=sg-xyz456

🛠 TerraformでEKSクラスタを作成する

provider "aws" {
  region = "us-west-2"
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "mlops-cluster"
  cluster_version = "1.26"
  subnet_ids      = ["subnet-abc123", "subnet-def456"]
}

クラスタ作成後、kubectl で確認できます。

aws eks update-kubeconfig --region us-west-2 --name mlops-cluster
kubectl get nodes

5. MLモデルのDockerコンテナ化 & ECR登録

🛠 Dockerfileの作成（PyTorchモデル例）

FROM python:3.8-slim
WORKDIR /app
COPY model.pt /app/
COPY server.py /app/
RUN pip install torch flask
CMD ["python", "server.py"]

🛠 AWS ECRへプッシュ

aws ecr create-repository --repository-name ml-model
ECR_URL=123456789012.dkr.ecr.us-west-2.amazonaws.com/ml-model

docker build -t ml-model .
docker tag ml-model:latest $ECR_URL:latest

eval $(aws ecr get-login --no-include-email)
docker push $ECR_URL:latest

6. KServe（KFServing）でモデルのデプロイ

KServeはKubernetes上でMLモデルのスケーラブルな推論APIを提供するツールです。

🛠 KServeをEKSにインストール

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.10.0/kserve.yaml

🛠 モデルデプロイのマニフェスト作成（TensorFlowモデル例）

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: mnist-predictor
spec:
  predictor:
    tensorflow:
      storageUri: "s3://mlops-bucket/models/mnist"

デプロイ後、推論APIを確認。

kubectl get inferenceservice

7. Argo Workflowsでトレーニングパイプラインの自動化

Argo Workflowsを使うことで、EKS上でトレーニングをスケジュールできます。

🛠 Argo Workflowsインストール

kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/install.yaml

🛠 MLトレーニングワークフローの定義

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: train-model-
spec:
  entrypoint: main
  templates:
    - name: main
      container:
        image: "123456789012.dkr.ecr.us-west-2.amazonaws.com/ml-training"

ワークフローの実行。

argo submit --watch train-model.yaml

8. 監視（Prometheus + Grafana）

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/master/bundle.yaml
kubectl apply -f https://raw.githubusercontent.com/grafana/helm-charts/main/charts/grafana/values.yaml

PrometheusとGrafanaを使用し、KServeの推論ログを可視化します。

9. CI/CDの自動化（GitHub Actions + ArgoCD）

🛠 GitHub ActionsでDockerビルド & ECRプッシュ

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2
      - name: Build & Push
        run: |
          docker build -t $ECR_URL:latest .
          docker push $ECR_URL:latest

🛠 ArgoCDでKubernetesデプロイ自動化

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-model-deploy
spec:
  destination:
    namespace: default
    server: https://kubernetes.default.svc

10. まとめ

✅ EKSでKubernetesクラスタを作成
✅ KServeでMLモデルをスケーラブルにデプロイ
✅ Argo Workflowsでトレーニングを自動化
✅ Prometheus + Grafanaで監視
✅ GitHub Actions + ArgoCDでCI/CD自動化

AWS EKSを活用することで、大規模なMLOpsを構築できます！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up