sglangでDeepSeek-R1-Distill-Qwen-32Bを動かす方法

Posted at 2025-02-25

はじめに

大規模言語モデルDeepSeek-R1-Distill-Qwen-32Bをsglangフレームワークで動かす手順を解説します。GPUマシン環境のセットアップからサーバー起動まで、実際のコマンド例を交えて説明します。

環境構築手順

1. Conda環境の作成

conda create -n sglang python=3.11 -y
conda activate sglang

2. 依存関係のインストール

pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.3.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
pip install transformers==4.48.3

サーバー起動スクリプト

起動コマンド

eval "$(conda shell.bash hook)"
conda activate sglang

CUDA_VISIBLE_DEVICES=3,1,0,2 \
TRANSFORMERS_OFFLINE=1 \
HF_DATASETS_OFFLINE=1 \
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --trust-remote-code \
  --served-model-name gpt-4 \
  --tensor-parallel-size 4 \
  --mem-fraction-static 0.9 \
  --api-key sk-xxxxxx \
  --host 0.0.0.0 \
  --port 8000

主要パラメータ解説

パラメータ	説明
`--tensor-parallel-size 4`	4つのGPUを使用したテンソル並列処理
`--mem-fraction-static 0.9`	GPUメモリの90%を事前割当
`--trust-remote-code`	カスタムモデルコードの実行許可
`--api-key sk-xxxxxx`	簡易認証用APIキー（本番環境では変更必須）

注意事項

GPU設定
CUDA_VISIBLE_DEVICESで指定するGPU番号は環境に応じて変更
セキュリティ
--host 0.0.0.0は外部アクセスを許可する設定（ファイアウォール設定要確認）
モデルサイズ
VRAM要件：約94GB（4GPUで分散処理時）
Transformersバージョン
互換性問題を避けるため指定バージョンを厳守

動作確認方法

サーバー起動後、以下の方法でAPIエンドポイントにアクセス可能です：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxxxx" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "日本で有名な観光地を3つ教えて"}]
  }'

まとめ

sglangを使用することで、大規模言語モデルの効率的なデプロイが可能になります。本記事の手順を参考に、DeepSeek-R1の推論環境構築に挑戦してみてください。

（注）本記事の設定は開発用環境を想定しています。本番環境ではセキュリティ設定の見直しを推奨します。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up