AIモデルを遊ぶAdvent Calendar 2024

AppleのAIMv2でマルチモーダル機能を活用編5：YOLOとAIMv2をGPUで高速化しリアルタイム処理へ

Last updated at 2024-12-19Posted at 2024-12-19

1. はじめに

こんにちは、しゅんです！

今回は、YOLOとAIMv2を組み合わせた物体検索システムをGPUの活用を通じて高速化し、リアルタイム処理を目指しました。この取り組みは、量子化を行わずにAIMv2の推論速度を劇的に改善し、精度を維持したまま効率化を図るものです。

この記事では、以下を解説します：

GPUを活用したAIMv2高速化プロセス
コードの最適化ポイント
実験結果と考察

2. 背景と目的

従来の課題

YOLO: 高速で高精度な物体検出が可能。ただし、単純なラベル分類しかできず、柔軟な条件付き検索には不向き。
AIMv2: テキストと画像の関連性を評価可能。条件付き検索が得意だが、推論速度が遅いためリアルタイム処理に適さない。

今回の目的

YOLOで物体を検出し、AIMv2で条件付きフィルタリングを行う。
GPUを活用してAIMv2の推論速度を向上させ、リアルタイム処理を可能にする。

3. GPU活用による高速化のアプローチ

コードの変更点

GPU処理の統一化:
- YOLOとAIMv2をともにGPU上で処理。
- データ転送のオーバーヘッドを削減。
バッチ処理の準備:
- 各領域ごとの逐次処理を改善し、並列処理への布石を作成。
描画処理の最適化:
- 処理後の描画を関数化し、再利用性を向上。
不要な処理の削減:
- リサイズや冗長な処理を排除。

4. 実験と結果

環境設定

GPU: NVIDIA RTX 3080
画像サイズ: 384x640
テキスト条件: ["cola can", "pepsi can", "sprite can", "fanta can"]
類似度閾値: 0.3

推論結果の比較

項目	変更前	変更後
総推論時間	約7.49秒	約0.195秒
FPS（推定）	約1 FPS以下	約5 FPS
メモリ使用量	高	中

main.pyhttps://qiita.com/syun88/items/8bb8eb68a873f8373854
比較結果

要素	以前のコード	現在のコード
デバイス処理	CPUで逐次処理	GPUで並列処理
描画方式	各領域で逐次描画	一括描画
リサイズ処理	領域ごとにリサイズを含む	リサイズ処理を省略
モデルロード	キャッシュ非効率	ロード時のキャッシュ利用が最適化

結果

(.venv) syun@syun:/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2$ python3 aimv2-large-patch14-224-lit/main_try_speedUP.py 
Loading AIMv2 model...
Model loaded in 5.84 seconds

image 1/1 /media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola4.jpg: 384x640 4 vases, 42.2ms
Speed: 2.3ms preprocess, 42.2ms inference, 91.0ms postprocess per image at shape (1, 3, 384, 640)
Processing AIMv2 detections...
Refined Results:
Region (492, 83, 631, 323) - cola can: 0.97
Region (352, 82, 477, 322) - fanra can: 1.00
Region (43, 82, 182, 322) - sprite can: 1.00
Region (199, 82, 326, 322) - pepsi can: 1.00
AIMv2 Total Inference Time: 0.1959 seconds

Code

from ultralytics import YOLO
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoProcessor, AutoModel
import torch
import time

# YOLOモデルのロード
yolo_model = YOLO("yolov8n.pt")

# 入力画像パス
image_path = "/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola4.jpg"
font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"  # フォントパス
font_size = 17
font = ImageFont.truetype(font_path, font_size)

# AIMv2モデルのロード
print("Loading AIMv2 model...")
start_load = time.time()
processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", trust_remote_code=True).to("cuda")
end_load = time.time()
print(f"Model loaded in {end_load - start_load:.2f} seconds")

# 条件テキスト
query_text = ["pepsi can", "cola can", "sprite can", "fanta can"]
threshold = 0.3  # 類似度の閾値

# 画像の読み込みとYOLOによる物体検出
image = Image.open(image_path).convert("RGB")
width, height = image.size
results = yolo_model(image_path)
detections = results[0].boxes

# YOLOの結果描画
def draw_yolo_results(image, detections):
    draw = ImageDraw.Draw(image)
    for detection in detections:
        x1, y1, x2, y2 = map(int, detection.xyxy[0])
        label = yolo_model.names[int(detection.cls)]
        confidence = detection.conf.item()
        draw.rectangle([(x1, y1), (x2, y2)], outline="green", width=2)
        draw.text((x1, y1 - 20), f"{label}: {confidence:.2f}", fill="green", font=font)
    return image

draw_yolo = draw_yolo_results(image.copy(), detections)

# AIMv2による条件一致領域の特定
print("Processing AIMv2 detections...")
refined_results = []
aim_start = time.time()

for detection in detections:
    x1, y1, x2, y2 = map(int, detection.xyxy[0])
    region = image.crop((x1, y1, x2, y2))

    # AIMv2で評価
    inputs = processor(images=region, text=query_text, return_tensors="pt", padding=True).to("cuda")
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=-1)

    for i, score in enumerate(probs[0]):
        if score > threshold:
            refined_results.append((x1, y1, x2, y2, query_text[i], score.item()))

aim_end = time.time()

# AIMv2の結果描画
def draw_aimv2_results(image, results):
    draw = ImageDraw.Draw(image)
    for x1, y1, x2, y2, label, score in results:
        draw.rectangle([(x1, y1), (x2, y2)], outline="blue", width=2)
        draw.text((x1, y1 - 20), f"{label}: {score:.2f}", fill="blue", font=font)
    return image

draw_aimv2 = draw_aimv2_results(image.copy(), refined_results)

# 処理結果の表示
print("Refined Results:")
for x1, y1, x2, y2, label, score in refined_results:
    print(f"Region ({x1}, {y1}, {x2}, {y2}) - {label}: {score:.2f}")

draw_yolo.show()  # YOLOの結果
draw_aimv2.show()  # AIMv2の結果

# 時間計測結果
print(f"AIMv2 Total Inference Time: {aim_end - aim_start:.4f} seconds")

5. 考察

成果

高速化: GPUを利用したことで、推論速度を大幅に向上。
精度維持: 量子化を行わないため、精度の低下がない。

課題

さらなる高速化: バッチ処理や非同期処理の導入でリアルタイム性の向上を図る。

6. まとめ

YOLOとAIMv2を組み合わせたシステムをGPU上で動作させ、高速化を実現しました。この取り組みは、リアルタイム処理を目指す物体検出システムの重要な一歩となります。

7. 参考リンク

GitHubリポジトリ: syun88
AIMv2公式リポジトリ: AIMv2 GitHub
論文リンク: arXiv: AIMv2

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up