Cloud RunのGPUでDeepSeek R1の蒸留モデルを動かしてみる

Posted at 2025-01-31

はじめに

Google CloudのサーバレスのサービスであるCloud RunがGPUを使用できるようになりました。
現在プレビュー機能となります。

GPUが使用できるようになるので、ローカルLLMをサーバレス環境上で動作させることができるようになります。
そこでDeepSeek R1の蒸留モデルであるDeepSeek-R1-Distill-Qwen-14BをCloud Run上で動作させてみました。

方法

Cloud Run上でGPUを使用できるようにリクエストする

「割り当てとシステム上限」からTotal Nvidia L4 GPU allocation, per project per regionの割り当てをリクエストします。
現在使用できるリージョンは以下の3つのようです。

us-central1（アイオワ）
asia-southeast1（シンガポール）
europe-west4（オランダ）

リクエストを申請してから、数日で使用できるようになります。

Dockerイメージの作成

LLM推論サーバーであるOllamaを使用して、推論します。
MODELのところでdeepseek-r1:14bを指定します。

Dockerfile

FROM ollama/ollama:latest

# Listen on all interfaces, port 8080
ENV OLLAMA_HOST 0.0.0.0:8080

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1 

# Store the model weights in the container image
ENV MODEL deepseek-r1:14b
RUN ollama serve & sleep 5 && ollama pull $MODEL 

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Dockerイメージをビルドして、Artifact RegistryにPushします。

以下を参考。

Cloud Runにデプロイする

Cloud RunでGPUを使用する場合、以下のリソースに対して最低限の設定する必要があります。

CPU
- 4
メモリ
- 16 GiB

Artifact RegistryにPushしたイメージを指定して、デプロイします。

テスト

Cloud Run プロキシを使用して、アクセスできるようにします。

gcloud run services proxy {サービス名} --port=9090

別のターミナルからリクエストを投げます。
Ollamaのサーバーに対して、/api/generateにリクエストを投げると回答が返ってきます。

curl http://localhost:9090/api/generate -d '{
  "model": "deepseek-r1:14b",
  "stream": false,
  "prompt":"日本語で回答してください。日本の首都は？"
}'

回答

"\u003cthink\u003e\nAlright, the user is asking about the capital of Japan in Japanese. 
I know that Tokyo is the correct answer.
\n\nI should respond in Japanese as well to match their request.
\n\nMaybe add a friendly emoji to keep it approachable.
\n\u003c/think\u003e\n\n
日本の首都は東京です！ (東京です！) 😊"

全体の出力

{
  "model": "deepseek-r1:14b",
  "created_at": {時間},
  "response": "\u003cthink\u003e\nAlright, the user is asking about the capital of Japan in Japanese. I know that Tokyo is the correct answer.\n\nI should respond in Japanese as well to match their request.\n\nMaybe add a friendly emoji to keep it approachable.\n\u003c/think\u003e\n\n日本の首都は東京です！ (東京です！) 😊",
  "done": true,
  "done_reason": "stop",
  "context": [
    151644, 101059, 102819, 16161, 102104, 134093, 1773, 131888, 106114, 15322,
    11319, 151645, 151648, 198, 71486, 11, 279, 1196, 374, 10161, 911, 279,
    6722, 315, 6323, 304, 10769, 13, 358, 1414, 429, 26194, 374, 279, 4396,
    4226, 382, 40, 1265, 5889, 304, 10769, 438, 1632, 311, 2432, 862, 1681, 382,
    21390, 912, 264, 11657, 42365, 311, 2506, 432, 5486, 480, 624, 151649, 271,
    131888, 106114, 15322, 102356, 46553, 37541, 6313, 320, 102356, 46553,
    37541, 6313, 8, 26525, 232
  ],
  "total_duration": 2802728103,
  "load_duration": 24308757,
  "prompt_eval_count": 13,
  "prompt_eval_duration": 10000000,
  "eval_count": 66,
  "eval_duration": 2767000000
}

中国語で回答を返すときもありましたが、日本語で回答を返してくれました。

Cloud Runの指標からちゃんとGPUが使用されていることも確認できました。

おわりに

Cloud RunのGPUを使用して、ローカルLLMをOllamaで実行してみました。
GPUを自動スケーリングできるようになるので、AI推論のアプリ構築などが容易になりそうです。
小さなローカルLLMが出てきてもレイテンシーが気になるので、それでもGPUの利用は避けられないのではないかなと思っています。
サーバレスでGPUが使用できるCloud Runの使用頻度は高くなりそうです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up