【第4回】ONNX Runtimeで高速推論を実現する

Posted at 2025-12-13

ねらい

ONNX Runtimeを使って、ONNXモデルの推論を高速に実行できるようになる。

対象

第3回でONNXモデルを作成できるようになった人
モデルの推論速度を改善したい人
本番環境へのデプロイを考えている人

ゴール

ONNX Runtimeの基本的な使い方をマスター
CPU/GPU推論の切り替えができる
パフォーマンス最適化の基本を理解する

TL;DR

ONNX RuntimeはMicrosoft製の高速推論エンジン
InferenceSession でモデルをロードし、run() で推論
CUDAExecutionProvider でGPU推論
IOBindingでCPU-GPU間のデータ転送を最適化

ONNX Runtimeとは

まず公式の説明を見てみよう。

ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries.

（ONNX Runtimeは、ハードウェア固有のライブラリを統合するための柔軟なインターフェースを備えた、クロスプラットフォームの機械学習モデルアクセラレータである。）

出典: ONNX Runtime Documentation

要するに、ONNXモデルを高速に動かすためのエンジンだ。

なぜONNX Runtimeを使うのか

PyTorchやTensorFlowでも推論はできる。しかし、これらは「訓練も推論もできる汎用フレームワーク」だ。

ONNX Runtimeは推論に特化している。だから速い。

High-scale Microsoft services such as Bing, Office, and Azure AI use ONNX Runtime. Although performance gains depend on many factors, these Microsoft services average a 2x performance gain on CPU because they use ONNX.

（Bing、Office、Azure AIなどのMicrosoftの大規模サービスはONNX Runtimeを使用している。パフォーマンスの向上は多くの要因に依存するが、これらのMicrosoftサービスはONNXを使用することでCPUで平均2倍のパフォーマンス向上を達成している。）

出典: Microsoft Learn - ONNX

Microsoftの主力サービスで 平均2倍の高速化。これは魅力的だ。

基本的な使い方

インストール

# CPU版
pip install onnxruntime

# GPU版（CUDA対応）
pip install onnxruntime-gpu

注意: GPU版をインストールする場合、適切なCUDAバージョンが必要。公式のインストールガイドを確認しよう。

最もシンプルな推論

import onnxruntime as ort
import numpy as np

# セッションを作成（モデルをロード）
session = ort.InferenceSession("model.onnx")

# 入力データを準備（NumPy配列）
input_data = np.random.randn(1, 784).astype(np.float32)

# 入力名を取得
input_name = session.get_inputs()[0].name

# 推論実行
outputs = session.run(None, {input_name: input_data})

# 結果を取得
result = outputs[0]
print(f"Output shape: {result.shape}")

これだけで動く。シンプル。

InferenceSessionの詳細

入出力の情報を取得

session = ort.InferenceSession("model.onnx")

# 入力の情報
for input in session.get_inputs():
    print(f"Input: {input.name}")
    print(f"  Shape: {input.shape}")
    print(f"  Type: {input.type}")

# 出力の情報
for output in session.get_outputs():
    print(f"Output: {output.name}")
    print(f"  Shape: {output.shape}")
    print(f"  Type: {output.type}")

特定の出力だけ取得

# すべての出力を取得
outputs = session.run(None, {input_name: input_data})

# 特定の出力だけ取得
outputs = session.run(["output_name"], {input_name: input_data})

To perform inferencing on your model, use run and pass in the list of outputs you want returned and a map of the input values. Leave the output list empty if you want all of the outputs.

（モデルで推論を実行するには、runを使用して、返してほしい出力のリストと入力値のマップを渡す。すべての出力が必要な場合は、出力リストを空にする。）

出典: Microsoft Learn - ONNX

GPU推論を有効にする

Execution Providerとは

ONNX Runtimeは「Execution Provider（EP）」という仕組みで、様々なハードウェアに対応している。

ONNX Runtime also provides an abstraction layer for hardware accelerators, such as Nvidia CUDA and TensorRT, Intel OpenVINO, Windows DirectML, and others.

（ONNX Runtimeは、NVIDIA CUDAやTensorRT、Intel OpenVINO、Windows DirectMLなど、ハードウェアアクセラレータのための抽象化レイヤーも提供している。）

出典: Microsoft Open Source Blog

主なExecution Provider:

EP	ハードウェア
CPUExecutionProvider	CPU（デフォルト）
CUDAExecutionProvider	NVIDIA GPU (CUDA)
TensorrtExecutionProvider	NVIDIA GPU (TensorRT)
OpenVINOExecutionProvider	Intel CPU/GPU
DirectMLExecutionProvider	Windows DirectML
CoreMLExecutionProvider	Apple Silicon

GPU推論の実装

import onnxruntime as ort
import numpy as np

# 利用可能なProviderを確認
print(ort.get_available_providers())
# 出力例: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

# GPU推論を有効にする（CUDA）
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# どのProviderが使われているか確認
print(session.get_providers())

providers はリストで指定し、先頭から優先的に使われる。CUDAが使えなければ自動的にCPUにフォールバックする。

GPUを指定する

複数GPUがある場合、どのGPUを使うか指定できる。

session = ort.InferenceSession(
    "model.onnx",
    providers=[
        ('CUDAExecutionProvider', {
            'device_id': 0,  # GPU 0を使用
            'arena_extend_strategy': 'kNextPowerOfTwo',
            'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB制限
        }),
        'CPUExecutionProvider'
    ]
)

パフォーマンス最適化

1. セッションオプション

# セッションオプションを設定
sess_options = ort.SessionOptions()

# グラフ最適化レベル
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# スレッド数（CPUの場合）
sess_options.intra_op_num_threads = 4

# 実行モード（シーケンシャル or パラレル）
sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

# セッション作成
session = ort.InferenceSession("model.onnx", sess_options)

グラフ最適化レベル

レベル	説明
ORT_DISABLE_ALL	最適化なし
ORT_ENABLE_BASIC	基本的な最適化
ORT_ENABLE_EXTENDED	拡張最適化
ORT_ENABLE_ALL	すべての最適化（推奨）

2. IOBinding（GPU最適化の要）

GPU推論で最も重要な最適化テクニック。

By default, ONNX Runtime will copy the input from the CPU (even if the tensors are already copied to the targeted device), and assume that outputs also need to be copied back to the CPU from GPUs after the run. These data copying overheads between the host and devices are expensive.

（デフォルトでは、ONNX Runtimeは入力をCPUからコピーし（テンソルがすでに対象デバイスにコピーされていても）、実行後に出力もGPUからCPUにコピーする必要があると想定する。ホストとデバイス間のこれらのデータコピーオーバーヘッドは高価である。）

出典: Hugging Face - ONNX Runtime GPU

つまり、毎回CPU-GPU間でデータを転送していると遅い。IOBindingでこれを回避できる。

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider']
)

# IOBindingを作成
io_binding = session.io_binding()

# 入力データをGPUに転送
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
input_tensor = ort.OrtValue.ortvalue_from_numpy(input_data, 'cuda', 0)

# 入力をバインド
io_binding.bind_ortvalue_input('input', input_tensor)

# 出力用のバッファをGPU上に確保
io_binding.bind_output('output', 'cuda')

# 推論実行
session.run_with_iobinding(io_binding)

# 結果を取得（GPUに残ったまま or CPUにコピー）
output = io_binding.get_outputs()[0].numpy()  # CPUにコピー

連続した推論では、データをGPU上に保持したまま処理できるので高速。

3. バッチ処理

小さな入力を何度も処理するより、バッチでまとめて処理する方が効率的。

# 非効率: 1件ずつ処理
for item in data:
    result = session.run(None, {"input": item})

# 効率的: バッチ処理
batch_data = np.stack(data)
results = session.run(None, {"input": batch_data})

ベンチマーク：PyTorch vs ONNX Runtime

実際にどれくらい速くなるか測定してみよう。

import time
import torch
import numpy as np
import onnxruntime as ort
import torchvision.models as models

# モデル準備
torch_model = models.resnet18(pretrained=True).eval()

ort_session = ort.InferenceSession(
    "resnet18.onnx",
    providers=['CPUExecutionProvider']
)

# テストデータ
np_input = np.random.randn(1, 3, 224, 224).astype(np.float32)
torch_input = torch.from_numpy(np_input)

# ウォームアップ
for _ in range(10):
    with torch.no_grad():
        _ = torch_model(torch_input)
    _ = ort_session.run(None, {"image": np_input})

# PyTorchベンチマーク
n_iterations = 100
start = time.time()
for _ in range(n_iterations):
    with torch.no_grad():
        _ = torch_model(torch_input)
pytorch_time = (time.time() - start) / n_iterations

# ONNX Runtimeベンチマーク
start = time.time()
for _ in range(n_iterations):
    _ = ort_session.run(None, {"image": np_input})
ort_time = (time.time() - start) / n_iterations

print(f"PyTorch: {pytorch_time*1000:.2f} ms/inference")
print(f"ONNX Runtime: {ort_time*1000:.2f} ms/inference")
print(f"Speedup: {pytorch_time/ort_time:.2f}x")

典型的な結果

環境によって異なるが、CPU推論で1.5〜2倍程度の高速化が期待できる。

PyTorch: 45.23 ms/inference
ONNX Runtime: 28.56 ms/inference
Speedup: 1.58x

GPU推論では、モデルやバッチサイズによって差が出やすい。小さいモデル・小さいバッチでは差が出にくく、大きいモデル・大きいバッチで効果が顕著になる。

量子化による高速化

動的量子化

モデルの重みをINT8に量子化することで、さらなる高速化が可能。

from onnxruntime.quantization import quantize_dynamic, QuantType

# 動的量子化を適用
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_quantized.onnx",
    weight_type=QuantType.QInt8
)

量子化の効果

モデルサイズ: 約1/4に削減
推論速度: 1.5〜3倍程度の高速化（CPU）
精度: わずかに低下（多くの場合許容範囲内）

ただし、精度への影響は必ず検証しよう。

本番環境へのデプロイ

Python以外の言語

ONNX Runtimeは多言語対応。

ONNX Runtime provides a consistent API across platforms and architectures with APIs in Python, C++, C#, Java, and more.

（ONNX Runtimeは、Python、C++、C#、Javaなどの言語でAPIを提供し、プラットフォームとアーキテクチャ間で一貫したAPIを提供する。）

出典: Microsoft Open Source Blog

C++での推論例

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
Ort::SessionOptions session_options;
Ort::Session session(env, "model.onnx", session_options);

// 入出力の準備と推論...

Webでの推論（ONNX Runtime Web）

ブラウザ上でも推論可能。

import * as ort from 'onnxruntime-web';

async function runInference() {
    const session = await ort.InferenceSession.create('model.onnx');
    const feeds = { input: new ort.Tensor('float32', inputData, [1, 784]) };
    const results = await session.run(feeds);
    console.log(results.output.data);
}

トラブルシューティング

GPUが使われない

print(session.get_providers())
# ['CPUExecutionProvider'] ← CUDAがない！

原因: onnxruntime-gpu がインストールされていない、またはCUDAのバージョンが合わない。

対処:

pip install onnxruntime-gpu を確認
CUDAバージョンを確認（nvcc --version）
公式のCUDA互換性表を確認

メモリ不足

RuntimeError: CUDA out of memory

対処:

バッチサイズを小さくする
gpu_mem_limit を設定する
複数GPUに分散する

推論結果がおかしい

原因: 入力データの型や形状が違う。

# 型を確認
input_info = session.get_inputs()[0]
print(f"Expected type: {input_info.type}")
print(f"Expected shape: {input_info.shape}")

# 入力データを適切に変換
input_data = input_data.astype(np.float32)  # 型を合わせる

まとめ

ONNX Runtime = ONNXモデル専用の高速推論エンジン
InferenceSession でモデルをロード、run() で推論
Execution Provider でハードウェアを選択（CPU/CUDA/TensorRT等）
IOBinding でGPU推論を最適化
量子化 でさらなる高速化が可能
Python以外の言語、ブラウザでも利用可能

次回予告

最終回となる第5回では、ONNXの限界と使いどころを整理する。「いつONNXを使うべきか」「使わないべきか」を判断できるようになろう。

参考文献

ONNX Runtime Documentation: https://onnxruntime.ai/docs/
ONNX Runtime Performance: https://onnxruntime.ai/docs/performance/
Hugging Face - ONNX Runtime GPU: https://huggingface.co/docs/optimum-onnx/en/onnxruntime/usage_guides/gpu
Microsoft Open Source Blog: https://cloudblogs.microsoft.com/opensource/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up