10分でできる！VLLMを使ったgoogle/gemma-3-27b-itのローカル環境構築

Last updated at 2025-03-18Posted at 2025-03-13

はじめに

VLLMは、大規模言語モデル（LLM）を高速かつ効率的に動作させるための軽量なサーバーです。本記事では、google/gemma-3-27b-itという日本語にサポートした高性能な言語モデルを、ローカルPCでVLLMを使って簡単に起動する方法を解説します。手順に従えば、わずか10分でモデルを動作させることが可能です。

環境準備

必要なツール

以下のツールを準備してください：

conda：Python仮想環境の管理に使用します。
pip：Pythonパッケージのインストールに使用します。
VLLM：LLMを高速に動作させるためのサーバーです。
flash-attn：モデルの推論速度を向上させるためのライブラリです。

仮想環境の作成

まず、Python 3.11の仮想環境を作成し、アクティベートします。

conda create -n vllm_main python=3.11 -y
conda activate vllm_main

VLLMと依存ライブラリのインストール

以下のコマンドで、VLLMとflash-attnとflashinferをインストールします。

git clone https://github.com/vllm-project/vllm.git; cd vllm
git checkout 46f98893d
git rev-parse HEAD
export VLLM_COMMIT=46f98893dd0c30365116563ab660c360b29c276b
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
pip install --editable .
pip install flash-attn --no-build-isolation
pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.5
pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

2025/3/18時点では、google/gemma-3-27b-itをサポートするVLLMはまだリリースされていないため、ソースコードからインストールする必要です。それに、最新のmain branchでインストールすると、Out of Memoryが発生して原因を究明できていないため、特定のCommitからインストールしました。

======旧======

以下のコマンドで、VLLMとflash-attnをインストールします。

git clone https://github.com/vllm-project/vllm.git; cd vllm
# VLLM_USE_PRECOMPILED=1 pip install --editable . vllm[audio]
VLLM_USE_PRECOMPILED=1 pip install --editable .
pip install flash-attn --no-build-isolation
pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

2025/3/13時点では、google/gemma-3-27b-itをサポートするVLLMはまだリリースされていないため、ソースコードからインストールする必要です。

=============

モデルのダウンロード

Hugging Face CLIのインストール

モデルをダウンロードするために、Hugging Face CLIをインストールします。

pip install "huggingface_hub[hf_transfer]"

google/gemma-3-27b-itのダウンロード

以下のコマンドで、google/gemma-3-27b-itをダウンロードします。

HF_HUB_ENABLE_HF_TRANSFER=1 \
huggingface-cli download google/gemma-3-27b-it

モデルの起動

起動コマンド

以下のコマンドでモデルを起動します。
（CUDA_VISIBLE_DEVICESで使用するGPUを指定し、--tensor-parallel-sizeでGPUの数を指定します。）

CUDA_VISIBLE_DEVICES=3,1,0,2 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
TRANSFORMERS_OFFLINE=1 \
HF_DATASETS_OFFLINE=1 \
vllm serve google/gemma-3-27b-it --trust-remote-code --served-model-name gpt-4o --gpu-memory-utilization 0.99 --tensor-parallel-size 4 --port 8000 --api-key sk-dummy --max-model-len 32768 --enable-chunked-prefill --limit-mm-per-prompt image=3

起動確認

起動に成功すると、以下のメッセージが表示されます。

INFO 03-13 08:17:14 [api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-13 08:17:14 [launcher.py:26] Available routes are:
INFO 03-13 08:17:14 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /health, Methods: GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /version, Methods: GET
INFO 03-13 08:17:14 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /pooling, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /score, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /rerank, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 03-13 08:17:14 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [102854]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

モデルの検証

Chatbox AIというツールを使用して、検証してみてみました。

検証１

鯉のぼりを識別してくれました。

検証２

Prompt通りに正しく生成してくれました。

検証３

検証３で生成されたものを画像にして、識別してもらいます。100％正解でした。

検証４

テキストを抽出してもらいました。惜しいところですが、1文字が違いました。

（誤）これは、このリリースの焦点と、リリースされる情報を反映しています。
（正）これは、このリリースの焦点と、リリースされる情勢を反映しています。

注意事項

GPUメモリの設定：--gpu-memory-utilization 0.99はGPUメモリの利用率を設定します。環境に応じて調整してください。
テンソル並列処理：--tensor-parallel-size 4は使用するGPUの数に応じて変更します。
ポート番号：--port 8000はAPIのポート番号です。他のアプリケーションと競合する場合は変更してください。

参考リンク

この手順に従えば、ローカルPCでgoogle/gemma-3-27b-itを簡単に動作させることができます。ぜひお試しください！

そのた

以下は、最新のmain branchでインストールすると、Out of Memoryが発生して原因を究明できていないための操作でした。

git log --oneline

Output

5eeabc2a (HEAD -> main, origin/main, origin/HEAD) [Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights (#14950)
18551e82 [V1] TPU - Fix CI/CD runner (#14974)
e41e1602 [V1] Guard Against Main Thread Usage (#14972)
b89fb2a4 [CI/Build] Use `AutoModelForImageTextToText` to load VLMs in tests (#14945)
5340b0e2 [Bugfix] Fix interface for Olmo2 on V1 (#14976)
37e38061 (tag: v0.8.0rc2) [Bugfix] Make Gemma3 MM V0 only for now (#14971)
c0efdd65 [Fix][Structured Output] using vocab_size to construct matcher (#14868)
aaaec52a [Bugfix][Model] Mixtral: use unused head_dim config argument (#14961)
e1eb45d3 [Bugfix] Fix precommit - line too long in pixtral.py (#14960)
89fca671 [V1] Default MLA to V1 (#14921)
d20b0c13 Add patch merger (#14957)
166a168b [Doc] Fix misleading log during multi-modal profiling (#14955)
2bb0e1a7 [Bugfix][ROCm] running new process using spawn method for rocm in tests. (#14810)
6eaf1e5c [Misc] Add `--seed` option to offline multi-modal examples (#14934)
868a8c5b [Bugfix] Fix Ultravox on V1 (#14929)
b4ad56c1 [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
69698f25 fix minor miscalled method (#14327)
cd0cd851 [MISC] More AMD unused var clean up (#14926)
0a74bfce setup.py: drop assumption about local `main` branch (#14692)
dd3b8658 [Doc] Add vLLM Beijing meetup slide (#14938)
9b87a579 [Misc][XPU] Use None as device capacity for XPU (#14932)
b539222d [V1] Remove input cache client (#14864)
8d6cf895 (tag: v0.8.0rc1) [V1] [Spec Decode] Support random sampling for spec decode (#13933)
583a9778 [Benchmark] Do not save detailed info to json by default (#14879)
a73e183e [Misc] Replace os environ to monkeypatch in test suite (#14516)
1e799b7e [BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context (#14910)
7f6c5ee0 [V1][Minor] Add __repr__ to ConstantList (#14907)
faa02757 [V1] Optimize the overhead of rewinding (#14905)
8a5a9b70 [CI/Build] Update defaults for test reproducibility (#14893)
bb3aeddf [CI] Nightly Tests (#14898)
aecc780d [V1] Enable Entrypoints Tests (#14903)
90df7f23 [Doc] Add guidance for using `ccache` with `pip install -e .` in doc (#14901)
b9b5bdfc [Misc] Catching Ray Compiled Graph PP test failures for V1 (#14847)
31060b27 [V1][BugFix] Detect interleaved sliding window attention (#14896)
fc1f6771 [BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
f6137adb Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) (#14892)
e53b1350 [Bugfix] Explicitly disable Phi-4-multimodal in V1 (#14889)
d30aa7e9 [Bugfix] Limit profiling run sequence length by max_model_len (#14785)
d1ad2a57 [V1] [Spec Decode] Fix ngram tests (#14878)
b82662d9 [BugFix] Fix torch distributed stateless PG backend init (#14870)
71c1e071 [Kernel] Add more tuned configs (#14877)
b30c75dd [V1] Remove V0 fallback for mistral-tokenizer (#14873)
def232e1 [VLM] Clean up Phi-4-MM ViT implementation (#14812)
3453b964 [Misc][Doc] Minor benchmark README update (#14874)
61c6a5a7 [VLM] Merged multi-modal processor for Pixtral (#12211)
74bc397b [Core] Expose API endpoint `/is_sleeping` (#14312)
f58aea00 [CI][Intel GPU] refine intel GPU ci docker build (#14860)
3556a414 [VLM] Limit multimodal input cache by memory (#14805)
9ed6ee92 [Bugfix] EAGLE output norm bug (#14464)
ee3778d5 [Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes (#14839)
aaacf173 [Doc] V1 user guide (#13991)
4c7629ca [V1][Structured Output] calculate vocab_size eagerly (#14851)
e0fdfa16 [CI/Build] Delete LoRA bias test (#14849)
5952d8ab [Attention] Get rid of mla cache alignment (#14842)
a2ae4965 [CPU] Support FP8 KV cache (#14741)
877e3522 [Docs] Add new East Coast vLLM Meetup slides to README and meetups.md (#14852)
d4d93db2 [V1] V1 Enablement Oracle  (#13726)
8c0d15d5 [Misc][Easy] Annotate unused vars in the csrc files (#14798)
97ac781c [Misc] Remove misleading message in gemma2 and gemma3 (#14850)
776dcec8 Disable outlines cache by default (#14837)
ccf02fcb Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… (#14848)
acaea3bb [Bugfix][V1] Fix flashinfer sampling (#14815)
9f374227 [Neuron][CI] update docker run command (#14829)
dd344e03 [Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … (#14844)
54a88044 [Doc] More neutral K8s deployment guide (#14084)
bbd94a19 [Build/CI] Upgrade aiohttp to incldue CVE fix (#14840)
233ffce1 [Build/CI] Move ninja to common deps (#14835)
40677783 [CI] Add TPU v1 test (#14834)
14f301b5 Update to torch==2.6.0 (#12721)
46f98893 [V1] Fix model parameterization for structured output tests (#14833)
...
略
...
c0c25e25 [Model] Add support for Gemma 3 (#14660)
...
略
...
ed6e9075 (tag: v0.7.3) [Bugfix] Fix deepseekv3 grouped topk error (#13474)

まず、gemma-3-27b-itのサポートはed6e9075 (tag: v0.7.3) 以降なので、まずは、ed6e9075 (tag: v0.7.3) のGItログを出力してみました。

次、本記事を作成した時点（2025/3/13）、c0c25e25 [Model] Add support for Gemma 3 (#14660)で試したので、問題はなかったです。

次、14f301b5 Update to torch==2.6.0 (#12721)はtorch==2.6.0への変更で、結構大きいかと思って、その一個前の46f98893 [V1] Fix model parameterization for structured output tests (#14833)をcheckoutして、構築しました。

以下のcommitsの中で、どれがOut of Memoryを引き起こしたのか、まだ要確認です。

5eeabc2a (HEAD -> main, origin/main, origin/HEAD) [Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights (#14950)
18551e82 [V1] TPU - Fix CI/CD runner (#14974)
e41e1602 [V1] Guard Against Main Thread Usage (#14972)
b89fb2a4 [CI/Build] Use `AutoModelForImageTextToText` to load VLMs in tests (#14945)
5340b0e2 [Bugfix] Fix interface for Olmo2 on V1 (#14976)
37e38061 (tag: v0.8.0rc2) [Bugfix] Make Gemma3 MM V0 only for now (#14971)
c0efdd65 [Fix][Structured Output] using vocab_size to construct matcher (#14868)
aaaec52a [Bugfix][Model] Mixtral: use unused head_dim config argument (#14961)
e1eb45d3 [Bugfix] Fix precommit - line too long in pixtral.py (#14960)
89fca671 [V1] Default MLA to V1 (#14921)
d20b0c13 Add patch merger (#14957)
166a168b [Doc] Fix misleading log during multi-modal profiling (#14955)
2bb0e1a7 [Bugfix][ROCm] running new process using spawn method for rocm in tests. (#14810)
6eaf1e5c [Misc] Add `--seed` option to offline multi-modal examples (#14934)
868a8c5b [Bugfix] Fix Ultravox on V1 (#14929)
b4ad56c1 [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
69698f25 fix minor miscalled method (#14327)
cd0cd851 [MISC] More AMD unused var clean up (#14926)
0a74bfce setup.py: drop assumption about local `main` branch (#14692)
dd3b8658 [Doc] Add vLLM Beijing meetup slide (#14938)
9b87a579 [Misc][XPU] Use None as device capacity for XPU (#14932)
b539222d [V1] Remove input cache client (#14864)
8d6cf895 (tag: v0.8.0rc1) [V1] [Spec Decode] Support random sampling for spec decode (#13933)
583a9778 [Benchmark] Do not save detailed info to json by default (#14879)
a73e183e [Misc] Replace os environ to monkeypatch in test suite (#14516)
1e799b7e [BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context (#14910)
7f6c5ee0 [V1][Minor] Add __repr__ to ConstantList (#14907)
faa02757 [V1] Optimize the overhead of rewinding (#14905)
8a5a9b70 [CI/Build] Update defaults for test reproducibility (#14893)
bb3aeddf [CI] Nightly Tests (#14898)
aecc780d [V1] Enable Entrypoints Tests (#14903)
90df7f23 [Doc] Add guidance for using `ccache` with `pip install -e .` in doc (#14901)
b9b5bdfc [Misc] Catching Ray Compiled Graph PP test failures for V1 (#14847)
31060b27 [V1][BugFix] Detect interleaved sliding window attention (#14896)
fc1f6771 [BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
f6137adb Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) (#14892)
e53b1350 [Bugfix] Explicitly disable Phi-4-multimodal in V1 (#14889)
d30aa7e9 [Bugfix] Limit profiling run sequence length by max_model_len (#14785)
d1ad2a57 [V1] [Spec Decode] Fix ngram tests (#14878)
b82662d9 [BugFix] Fix torch distributed stateless PG backend init (#14870)
71c1e071 [Kernel] Add more tuned configs (#14877)
b30c75dd [V1] Remove V0 fallback for mistral-tokenizer (#14873)
def232e1 [VLM] Clean up Phi-4-MM ViT implementation (#14812)
3453b964 [Misc][Doc] Minor benchmark README update (#14874)
61c6a5a7 [VLM] Merged multi-modal processor for Pixtral (#12211)
74bc397b [Core] Expose API endpoint `/is_sleeping` (#14312)
f58aea00 [CI][Intel GPU] refine intel GPU ci docker build (#14860)
3556a414 [VLM] Limit multimodal input cache by memory (#14805)
9ed6ee92 [Bugfix] EAGLE output norm bug (#14464)
ee3778d5 [Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes (#14839)
aaacf173 [Doc] V1 user guide (#13991)
4c7629ca [V1][Structured Output] calculate vocab_size eagerly (#14851)
e0fdfa16 [CI/Build] Delete LoRA bias test (#14849)
5952d8ab [Attention] Get rid of mla cache alignment (#14842)
a2ae4965 [CPU] Support FP8 KV cache (#14741)
877e3522 [Docs] Add new East Coast vLLM Meetup slides to README and meetups.md (#14852)
d4d93db2 [V1] V1 Enablement Oracle  (#13726)
8c0d15d5 [Misc][Easy] Annotate unused vars in the csrc files (#14798)
97ac781c [Misc] Remove misleading message in gemma2 and gemma3 (#14850)
776dcec8 Disable outlines cache by default (#14837)
ccf02fcb Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… (#14848)
acaea3bb [Bugfix][V1] Fix flashinfer sampling (#14815)
9f374227 [Neuron][CI] update docker run command (#14829)
dd344e03 [Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … (#14844)
54a88044 [Doc] More neutral K8s deployment guide (#14084)
bbd94a19 [Build/CI] Upgrade aiohttp to incldue CVE fix (#14840)
233ffce1 [Build/CI] Move ninja to common deps (#14835)
40677783 [CI] Add TPU v1 test (#14834)
14f301b5 Update to torch==2.6.0 (#12721)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up