Cloud RunにGemmaを乗っけてサイドカーパターンで構成しようと思ったけどうまくいかない

Posted at 2026-06-30

概要

Cloud RunにGemmaをのっけて遊んでみようとした
- 最初はモデルのエンドポイントを直接たたいていたが、gcloud auth print-identity-token で発行したIDトークンは約1時間で失効するし、フロントエンドから叩こうとするとCORSになるので遊べなかった
そこで、Nginx + Honoでフロントを用意しようとした
ただ、そのために複数サービスを用意するのはアレだったので、サイドカーパターンでやろうとしたら、なんかうまくいかなかった

結論

なんかうまくいかないので、サービス単位でわけると良いよ
サイドカー、Docker Composeそれぞれ試したけど駄目でした

Cloud Runのサイドカーや、Docker Composeとは

これです

やったこと

AIにお願いして、Gemma単体でCloud Run デプロイ → 動いた
AIにお願いして、サイドカーパターンで組み替えてもらった → デプロイできねぇ
AIにお願いして、Docker Composeで組み替えてもらった → デプロイできねぇ

Gemma単体でCloud Run デプロイ

GCSへアップロードしたモデルファイルを、Cloud Run起動時にダウンロードして動かす。という感じで構成しました。参考記事はこちら

GCSにモデルファイルをアップロードしておき、Cloud Run起動時にダウンロードして動かす構成

gcloud beta run deploy "${SERVICE_NAME}" \
  --image="${VLLM_IMAGE}" \
  --set-secrets="HF_TOKEN=${HF_TOKEN_SECRET_NAME}:latest" \
  --set-env-vars="MODEL_ID=${MODEL_NAME},AIP_STORAGE_URI=gs://${MODEL_CACHE_BUCKET}/..." \
  --args="python3,-m,vllm.entrypoints.openai.api_server,--host,0.0.0.0,--port,8080,..." \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --cpu="${CLOUD_RUN_CPU_NUM}" \
  --memory="${CLOUD_RUN_MEMORY_GB}Gi" \
  --min-instances=0 \
  --max-instances=1

これはあっさり動いた

サイドカーに組み替えたら詰まった

Nginx・Hono・vLLMの三コンテナ構成のservice.yamlを書いてデプロイしてみた

spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/launch-stage: BETA
        run.googleapis.com/cpu-throttling: 'false'
        # 起動順: vllm → hono → nginx
        run.googleapis.com/container-dependencies: '{"nginx":["hono"],"hono":["vllm"]}'
    spec:
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-rtx-pro-6000
      containers:
        - name: nginx
          image: ${NGINX_IMAGE}
          ports:
            - containerPort: 8080
          resources:
            limits:
              cpu: '20'
              memory: 80Gi

        - name: hono
          image: ${HONO_IMAGE}
          env:
            - name: API_KEY
              valueFrom:
                secretKeyRef:
                  name: ${API_KEY_SECRET_NAME}
                  key: latest

        - name: vllm
          image: ${VLLM_IMAGE}
          resources:
            limits:
              cpu: '1'
              memory: 8Gi
              nvidia.com/gpu: '1'

最初のエラーはこれ

ERROR: spec.template.spec.containers[2].resources.limits.cpu: Invalid value specified for container cpu.
Must be equal to one of 20.0, 22.0, 24.0, 26.0, 28.0, 30.0

nvidia-rtx-pro-6000を使うGPUインスタンスでは、CPU数を20, 22, 24, 26, 28, 30のいずれかにしなければならないらしい。なるほど

ねくすと罠（？）

CPUの配分を調整し始めたところで次の罠（？）を踏みました
Cloud Runはlimitsを指定していないサイドカーコンテナに対して、デフォルトで1000m（1CPU相当）を自動付与する。nginxに20CPUを割り当てて他コンテナを無指定にすると、honoとvllmにも1CPUずつ自動付与されて合計22CPUになり下記エラーが出力されてデプロイに失敗する

ERROR: spec.template.spec.containers.resources.limits.cpu: Invalid total cpu 41000 across all containers.
Total millicpu may not exceed 30000.
Sidecar containers will get a default cpu limit of 1000 if not specified.

じゃあ、全コンテナにlimitsを明示して合計を20にしてしまおう

今度はメモリでエラーが出た

ERROR: spec.template.spec.containers[0].resources.limits.memory: Invalid value specified for container memory.
For 20.0 CPU, memory must be between 80Gi and 80Gi inclusive.

20CPUのGPUインスタンスではメモリが80GiB固定で、全コンテナの合計がこの値にぴったり一致しないとはじかれる
CPUを合計20にして、メモリを合計80Giにして、サイドカーのデフォルト付与も考慮する、という条件を全部同時に満たすのが難しく、AIと延々と設定を調整し続けたが何かを直すと別の何かが壊れる状態でAI先生が永遠とループしたので、諦めました

Docker Composeでも試した

サイドカーで詰まったのでDocker Composeに切り替えてみた
が、似たようなエラーになったのでこれも諦めた

最終的にどうしたか

それぞれ別Cloud Runに分けました

Nginx + Hono のCloud Run（プロキシ）
Gemmaが起動するCloud Run
APIサーバーに/ai/gemma/chat/completionsというプロキシルートを追加
フロントはAPIキーをリクエストヘッダーに乗せてAPIサーバーに送り、APIサーバー側がgoogle-auth-libraryでサービス間IDトークンを都度発行してGemmaに転送するようにした
- Gemmaサービスは--no-allow-unauthenticatedにして、APIサーバーのサービスアカウントにのみrun.invokerを付与する。フロントが直接Gemmaのエンドポイントをたたく必要がなくなり、CORS問題も解決済み

import { GoogleAuth } from 'google-auth-library'

const auth = new GoogleAuth()

export async function forwardToGemma(body: GemmaChatCompletionRequest) {
  const gemmaUrl = process.env.GEMMA_INFERENCE_URL
  if (!gemmaUrl) throw new Error('GEMMA_INFERENCE_URL が設定されていません')

  const targetUrl = `${gemmaUrl}/v1/chat/completions`

  // Cloud Run サービス間認証: SAのIDトークンを自動取得
  const client = await auth.getIdTokenClient(gemmaUrl)
  const authHeaders = await client.getRequestHeaders(targetUrl)

  const response = await fetch(targetUrl, {
    method: 'POST',
    headers: { ...authHeaders, 'Content-Type': 'application/json' },
    body: JSON.stringify(body),
  })

  if (!response.ok) {
    const text = await response.text()
    throw new Error(`Gemma API error: HTTP ${response.status} - ${text}`)
  }

  return response.json()
}

雑感

めんどくさかった
最初からできないと分かっていれば。あと原因が本当にわからない

おまけ

ローカルLLM試したら楽しかった

$ brew install ollama
# ローカルで推論サーバが起動する
$ ollama serve

# 別ターミナルで
$ ollama
-> Launch Pi

 日本語で挨拶してね！


 The user requested a greeting in Japanese. I should respond with a
 natural and friendly Japanese greeting. I'll ensure the tone matches an
 expert assistant persona while being polite (using appropriate
 Keigo/formal Japanese).

 こんにちは！ 😊

 何かお手伝いできることはありますか？ プロジェクトのコードレビュー、機能の
 実装、デバッグなど、何でもお気軽にお申し付けくださいね。

すごい

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up