Instana + vLLM + WatsonX AI Model で LLM を可視化検証

Last updated at 2025-02-25Posted at 2025-02-25

はじめに

WatsonX AI は IBM の AI モデル を活用できる強力なプラットフォームであり、ibm-granite/granite-3.0-2b-instruct のような 高度な LLM (大規模言語モデル) を提供しています。今回、Instana + vLLM + WatsonX AI Model を組み合わせ、トレースの可視化・分析 を行う手順を紹介します。

vLLM とは？

vLLM は、LLM を効率的に推論 (inference) するための 最適化された推論エンジン です。特に 高速なトークン生成と低レイテンシー に優れており、WatsonX AI のような大規模モデルを効率的に利用するためのプラットフォームとして活用できます。

vLLM の特徴

Continuous Batching: 複数のリクエストを動的にバッチ処理し、スループットを最大化
Memory-efficient KV Cache Management: 効率的なキャッシュ管理により、大規模な推論処理をサポート
柔軟な API サーバー: LLM の推論を API 経由で利用できるため、さまざまなアプリケーションと統合が容易

Instana で vLLM を監視・トレースするメリット

Instana を活用することで、vLLM の推論プロセスを詳細に監視し、パフォーマンスの最適化が可能になります。

1. vLLM のボトルネックを可視化

Instana による監視では、LLM の推論レイテンシーや GPU/CPU 使用率などの パフォーマンスメトリクス をリアルタイムで取得できます。これにより、以下のようなボトルネックを特定できます。

推論処理における レイテンシーの増加要因 の分析
リソースの過剰消費 やメモリ不足の検出
リクエストの増加時におけるスケーリングの問題 の把握

2. OpenTelemetry を活用したエンドツーエンドのトレース

Instana は OpenTelemetry に対応 しており、クライアント (API リクエスト) から vLLM (推論処理) までの トレース情報を一元管理 できます。これにより、

API リクエストごとの処理時間 を可視化し、ボトルネックを特定
クライアント・サーバー間の トレースを統合 し、LLM のレスポンス遅延の原因を特定
異常が発生した際のトラブルシューティング を効率化

3. 運用負荷の軽減とスケールの最適化

大規模な LLM を運用する場合、システムの 安定性と効率性 を確保することが重要です。Instana を利用することで、

動的なスケーリング を適切に設定し、リクエスト量に応じた最適なリソース配分を実現
エラー発生時の迅速な対応 が可能となり、ダウンタイムを最小化
異常検知アラート により、障害発生前に問題を特定し、プロアクティブな運用が可能

検証環境

構成	バージョン / スペック
OS	Ubuntu 20.04
GPU	RTX 4090 (16GB VRAM)
Python	3.10.12
vLLM	0.7.1
Instana Agent	最新版
WatsonX Model	`ibm-granite/granite-3.0-2b-instruct`

1. Instana Agent のセットアップ

まずは、Instana Agent をインストールし、OpenTelemetry のポート (4317) を有効化 します。

1.1 Instana Agent のインストール

Linux環境のインストールドキュメントをご参照。

1.2 OpenTelemetry の有効化

1.2.1 [instana_installation_dir]/etc/instana/configuration.yaml

com.instana.plugin.opentelemetry:
  enabled: true

1.2.2 Instana Agentを再起動

sudo systemctl restart instana-agent.service
netstat -ano | grep 4317

✅ 0.0.0.0:4317 が表示されれば成功！

2. Python 環境のセットアップ

vLLM と OpenTelemetry を動作させるために、仮想環境を作成し、必要なライブラリをインストール します。

# Python の仮想環境を作成
python3 -m venv vllm_v0.7.1
source vllm_v0.7.1/bin/activate

# 必要なパッケージをインストール
pip install vllm fastapi uvicorn
pip install 'opentelemetry-sdk>=1.30.0' \
            'opentelemetry-api>=1.30.0' \
            'opentelemetry-exporter-otlp>=1.30.0' \
            'opentelemetry-instrumentation-fastapi>=0.51b0' \
            'opentelemetry-instrumentation-asgi>=0.51b0' \
            'opentelemetry-semantic-conventions-ai>=0.5.0'

3. OpenTelemetry関連の環境変数設定

今回の場合に、Agent経由でOpentelemetryで収集した情報をInstanaのSaaS Backendに送信します。

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://localhost:4317"
export OTEL_SERVICE_NAME="vllmservicejacky"
export OTEL_EXPORTER_OTLP_INSECURE=true

4. vLLM + WatsonX AI の起動

4.1 vLLM を API サーバーとして起動し、IBM WatsonX の Granite モデル (ibm-granite/granite-3.0-2b-instruct) をロードします。

vllm serve ibm-granite/granite-3.0-2b-instruct --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"

......
INFO 02-25 13:15:55 cuda.py:229] Using Flash Attention backend.
INFO 02-25 13:15:55 model_runner.py:1110] Starting to load model ibm-granite/granite-3.0-2b-instruct...
INFO 02-25 13:15:56 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.51s/it]

INFO 02-25 13:16:00 model_runner.py:1115] Loading model weights took 4.7199 GB
INFO 02-25 13:16:01 worker.py:267] Memory profiling takes 0.70 seconds
INFO 02-25 13:16:01 worker.py:267] the current vLLM instance can use total_gpu_memory (15.99GiB) x gpu_memory_utilization (0.90) = 14.39GiB
INFO 02-25 13:16:01 worker.py:267] model weights take 4.72GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 0.47GiB; the rest of the memory reserved for KV Cache is 9.15GiB.
INFO 02-25 13:16:01 executor_base.py:111] # cuda blocks: 7498, # CPU blocks: 3276
INFO 02-25 13:16:01 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 29.29x
INFO 02-25 13:16:01 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|███████████| 35/35 [00:13<00:00,  2.53it/s]
INFO 02-25 13:16:15 model_runner.py:1562] Graph capturing finished in 14 secs, took 0.30 GiB
INFO 02-25 13:16:15 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 15.55 seconds
INFO 02-25 13:16:16 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
....
INFO:     Started server process [1125356]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

4.2 Instana の UI にログインし、アプリケーション画面にあるサービス一覧の中に、 vllmservicejacky が検出されたことを確認します。

4.3 上記の検出されたサービスで新規アプリケーション・パースペクティブ（例：watsonx3_vllm_jacky)を作成します。

5. AI アプリケーションでのテスト

サンプルAIアプリケーション側からWatsonX AI モデルを呼び出し、OpenTelemetry でトレースを記録します。

5.1 サンプルAIアプリケーションのコード
ファイル: ./trainingtest/client_watsonx_otel.py

import requests
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import SpanKind, set_tracer_provider
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

# ✅ OpenTelemetry 設定
trace_provider = TracerProvider()
set_tracer_provider(trace_provider)
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
tracer = trace_provider.get_tracer("dummy-client")

# ✅ vLLM API エンドポイント
vllm_url = "http://localhost:8000/v1/completions"

# ✅ トレース開始
with tracer.start_as_current_span("client-span", kind=SpanKind.CLIENT) as span:
    prompt = "What do you think about the future of AI Agents and their impact on society?"
    span.set_attribute("prompt", prompt)
    headers = {}
    TraceContextTextMapPropagator().inject(headers)
    
    payload = {
        "model": "ibm-granite/granite-3.0-2b-instruct",
        "prompt": prompt,
        "max_tokens": 256,
        "n": 1,
        "best_of": 1,
        "use_beam_search": "false",  # ❌ Beam Search 無効化
        "top_k": 50,  # ✅ Top-K サンプリング
        "top_p": 0.9,  # ✅ Nucleus サンプリング
        "temperature": 0.7,  # ✅ ランダム性を追加
    }

    response = requests.post(vllm_url, headers=headers, json=payload)
    response_json = response.json()
    
    if "choices" in response_json and len(response_json["choices"]) > 0:
        print("Generated text:", response_json["choices"][0]["text"])
    else:
        print("❌ No generated text. Full response:", response_json)

5.2 サンプルアプリの実行（サンプルアプリのホストとコンソールから実行）

export OTEL_SERVICE_NAME="vllmclient1jacky"
python ./trainingtest/client_watsonx_otel.py

✅ 成功時の出力:

Generated text:

The future of AI agents is indeed promising and holds immense potential for transforming various aspects of society. 
Here are some key points to consider:

1. **Advancements in AI Capabilities**: AI agents are expected to become more sophisticated, with improved NLP and ML.
2. **Integration into Daily Life**: AI agents will play a key role in healthcare, education, and personalized assistance.
3. **Economic and Social Impact**: AI will reshape industries, but ethical challenges must be addressed.
4. **Ethical and Regulatory Considerations**: Fairness, transparency, and AI safety will be crucial.
{
    "name": "client-span",
    "context": {
        "trace_id": "0xfed1bcae5dee00da58c0383a2f633b3e",
        "span_id": "0x7b7c4f12281bbca0",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2025-02-25T04:27:03.790260Z",
    "end_time": "2025-02-25T04:27:07.492964Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "prompt": "What do you think about the AI Agent's future?"
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.30.0",
            "service.name": "vllmclient1jacky"
        },
        "schema_url": ""
    }
}

6. Instana の UI で該当サービスの可視化の確認

6.1. アプリケーション > watsonx3_vllm_jacky にアクセス

6.2. 分析 > アプリケーション > 呼び出しに確認

6.3. 呼び出しとトレースの詳細を確認

まとめ

✅ WatsonX AI モデルを vLLM 経由でデプロイ
✅ Instana + OpenTelemetry で LLM のトレースを可視化

🚀 これで WatsonX AI モデルを Instana + vLLM で監視する方法が実証されました！

参考

Instana vllm可視化ドキュメント

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up