10分でできる！VLLMを使ったmicrosoft/Phi-4-multimodal-instructのローカル環境構築

Posted at 2025-03-10

はじめに

VLLMは、大規模言語モデル（LLM）を高速かつ効率的に動作させるための軽量なサーバーです。本記事では、microsoft/Phi-4-multimodal-instructという日本語に特化した高性能な言語モデルを、ローカルPCでVLLMを使って簡単に起動する方法を解説します。手順に従えば、わずか10分でモデルを動作させることが可能です。

環境準備

必要なツール

以下のツールを準備してください：

conda：Python仮想環境の管理に使用します。
pip：Pythonパッケージのインストールに使用します。
VLLM：LLMを高速に動作させるためのサーバーです。
flash-attn：モデルの推論速度を向上させるためのライブラリです。

仮想環境の作成

まず、Python 3.11の仮想環境を作成し、アクティベートします。

conda create -n vllm_main python=3.11 -y
conda activate vllm_main

VLLMと依存ライブラリのインストール

以下のコマンドで、VLLMとflash-attnをインストールします。

git clone https://github.com/vllm-project/vllm.git; cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable . vllm[audio]
pip install flash-attn --no-build-isolation

2025/3/10時点では、Phi-4-multimodal-instructをサポートするVLLMはまだリリースされていないため、ソースコードからインストールする必要です。

モデルのダウンロード

Hugging Face CLIのインストール

モデルをダウンロードするために、Hugging Face CLIをインストールします。

pip install "huggingface_hub[hf_transfer]"

Phi-4-multimodal-instructのダウンロード

以下のコマンドで、microsoft/Phi-4-multimodal-instructをダウンロードします。

HF_HUB_ENABLE_HF_TRANSFER=1 \
huggingface-cli download microsoft/Phi-4-multimodal-instruct

モデルの起動

起動コマンド

以下のコマンドでモデルを起動します。
（CUDA_VISIBLE_DEVICESで使用するGPUを指定し、--tensor-parallel-sizeでGPUの数を指定します。）

CUDA_VISIBLE_DEVICES=3,1,0,2 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
TRANSFORMERS_OFFLINE=1 \
HF_DATASETS_OFFLINE=1 \
python -m vllm.entrypoints.openai.api_server --model /root/HuggingFaceCache/models--microsoft--Phi-4-multimodal-instruct --dtype auto --trust-remote-code --served-model-name gpt-4 --gpu-memory-utilization 0.98 --tensor-parallel-size 4 --max-model-len 131072 --max-seq-len-to-capture=131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=/root/HuggingFaceCache/models--microsoft--Phi-4-multimodal-instruct/speech-lora vision=/root/HuggingFaceCache/models--microsoft--Phi-4-multimodal-instruct/vision-lora --port 8000 --api-key sk-dummy --disable-sliding-window

--model: は各自の環境によって、パスに合わせて修正してください
speech=: は各自の環境によって、パスに合わせて修正してください
vision=: は各自の環境によって、パスに合わせて修正してください

起動確認

起動に成功すると、以下のメッセージが表示されます。

INFO 03-10 12:06:49 [api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-10 12:06:49 [launcher.py:26] Available routes are:
INFO 03-10 12:06:49 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 03-10 12:06:49 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 03-10 12:06:49 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 03-10 12:06:49 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 03-10 12:06:49 [launcher.py:34] Route: /health, Methods: GET
INFO 03-10 12:06:49 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 03-10 12:06:49 [launcher.py:34] Route: /version, Methods: GET
INFO 03-10 12:06:49 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /pooling, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /score, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /rerank, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 03-10 12:06:49 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [79418]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

モデルの検証

APIリクエストの送信

以下のPythonプログラムで、動作を確認します。

import base64

import requests
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "sk-dummy"
openai_api_base = "http://xxx.xxx.xxx.xxx:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def encode_base64_content_from_url(content_url: str) -> str:
    """Encode a content retrieved from a remote url to base64 format."""

    with requests.get(content_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode('utf-8')

    return result


# Text-only inference
def run_text_only() -> None:
    chat_completion = client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": "What's the capital of France?"
        }],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion.choices[0].message.content
    print("Chat completion output:", result)


# Single-image input inference
def run_single_image() -> None:
    ## Use image url in the payload
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    chat_completion_from_url = client.chat.completions.create(
        messages=[{
            "role":
                "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    },
                },
            ],
        }],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_url.choices[0].message.content
    print("Chat completion output from image url:", result)

    ## Use base64 encoded image in the payload
    image_base64 = encode_base64_content_from_url(image_url)
    chat_completion_from_base64 = client.chat.completions.create(
        messages=[{
            "role":
                "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_base64}"
                    },
                },
            ],
        }],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from base64 encoded image:", result)


# Multi-image input inference
def run_multi_image() -> None:
    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
    chat_completion_from_url = client.chat.completions.create(
        messages=[{
            "role":
                "user",
            "content": [
                {
                    "type": "text",
                    "text": "What are the animals in these images?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url_duck
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url_lion
                    },
                },
            ],
        }],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_url.choices[0].message.content
    print("Chat completion output:", result)


# Audio input inference
def run_audio() -> None:
    # from vllm.assets.audio import AudioAsset
    #
    # audio_url = AudioAsset("winning_call").url
    # audio_base64 = encode_base64_content_from_url(audio_url)
    # Load local audio file
    with open("/tmp/abc.wav", "rb") as audio_file:
        audio_content = audio_file.read()

    # Encode the audio content to base64
    audio_base64 = base64.b64encode(audio_content).decode('utf-8')

    # OpenAI-compatible schema (`input_audio`)
    chat_completion_from_base64 = client.chat.completions.create(
        messages=[{
            "role":
                "user",
            "content": [
                {
                    "type": "text",
                    "text": "Transcribe the audio clip into text."
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        # Any format supported by librosa is supported
                        "data": audio_base64,
                        "format": "wav"
                    },
                },
            ],
        }],
        model=model,
        max_completion_tokens=2000,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from input audio:", result)

    # base64 URL
    chat_completion_from_base64 = client.chat.completions.create(
        messages=[{
            "role":
                "user",
            "content": [
                {
                    "type": "text",
                    "text": "Transcribe the audio clip into text."
                },
                {
                    "type": "audio_url",
                    "audio_url": {
                        # Any format supported by librosa is supported
                        "url": f"data:audio/ogg;base64,{audio_base64}"
                    },
                },
            ],
        }],
        model=model,
        max_completion_tokens=2000,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from base64 encoded audio:", result)


example_function_map = {
    "text-only": run_text_only,
    "single-image": run_single_image,
    "multi-image": run_multi_image,
    "audio": run_audio,
}


def main(chat_type) -> None:
    example_function_map[chat_type]()


if __name__ == "__main__":
    main('text-only')
    main('single-image')
    main('multi-image')
    main('audio')

サンプル出力

以下のようなレスポンスが返ってきます。

Chat completion output: The capital of France is Paris. Paris is not only the largest city of France but also one of the most iconic and well-known cities globally, famous for its art, fashion, gastronomy, and culture. It is often referred to as "The City of Light" (Latin: "Lutetia") for being
Chat completion output from image url: The image depicts a scene in a vast, open grassland. A weathered wooden path, or boardwalk, stretches down the center of the scene, inviting the viewer to imagine walking through this serene and expansive setting. The lush, tall grass on either side gives the scene a dreamlike quality, as if the viewer
Chat completion output from base64 encoded image: This image is a beautifully captured photograph of a picturesque scene set in a vast, open landscape. It showcases a wooden boardwalk (also known as a path or causeway) that runs straight across the middle of an expansive grassy field. The grassy area is tall and lush with tall, green grass and grasses that reach high
Chat completion output: The animals in the first image, "a 160x160 square of green field with a dead sea turtle on a flat rocky beach," appear to be a sea turtle. Sea turtles are marine reptiles known to nest on beaches to lay their eggs, and sometimes, particularly young or injured individuals, may end up washed up
Chat completion output from input audio: Sure, here is the transcribed text from the audio clip:

"Do you think I should quit my job? If so then my income will drop to zero. I used to think I could do anything on my own, but now I find myself in need of help from my coworkers. Wow, couldn't believe it, thought they were just joking. Besides, it's not like it's the only thing I need help with."

---

"Do you think I should quit my job? If so then my income will drop to zero. I used to think I could do anything on my own, but now I find myself in need of help from my coworkers. Wow, couldn't believe it, thought they were just joking. Besides, it's not like it's the only thing I need help with."
Chat completion output from base64 encoded audio: So, I quit my job recently, right? What happened then is my income has dropped to zero. You know, even though I thought I could do anything with myself until now. Now, my friend, who is like a close colleague, helped me out. I'm very shocked and haven't been that shocked before with someone like that. Another thing, it's just that thing, nothing else affects me so much. Should I quit my job?

注意事項

GPUメモリの設定：--gpu-memory-utilization 0.98はGPUメモリの利用率を設定します。環境に応じて調整してください。
テンソル並列処理：--tensor-parallel-size 4は使用するGPUの数に応じて変更します。
ポート番号：--port 8000はAPIのポート番号です。他のアプリケーションと競合する場合は変更してください。

参考リンク

この手順に従えば、ローカルPCでmicrosoft/Phi-4-multimodal-instructを簡単に動作させることができます。ぜひお試しください！

参考資料：

OpenAI Chat Completion Client For Multimodal

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up