AIモデルを遊ぶAdvent Calendar 2024

AppleのAIMv2でマルチモーダル機能を活用編4：YOLOとAIMv2を量子化して効率化を目指す

Last updated at 2024-12-19Posted at 2024-12-12

1. はじめに

こんにちは、しゅんです！

今回は、YOLOとAIMv2を組み合わせた物体検索システムを効率化するため、AIMv2モデルの量子化を試しました。この取り組みは、特定条件での物体検出を高速化し、よりリアルタイムな応用を目指したものです。

この記事では、以下を解説します：

AIMv2の量子化プロセス
GPT-4を活用した層の分析と最適化
実験結果と考察

2. 背景と目的

従来の手法の課題

YOLO:
- 高速で高精度な物体検出が可能。
- しかし、単純なラベル分類しかできず、複雑な条件付きの検索は苦手。
AIMv2:
- 画像とテキストの関連性を評価可能で、柔軟な条件付き検索が得意。
- 一方で、推論速度が遅く、特にリアルタイム処理には不向き。

今回の目的

YOLOで物体を検出した後、その情報をAIMv2でフィルタリングし、条件に合致する物体を選択。
AIMv2を量子化して計算効率を向上させることで、速度の課題を解消。

3. AIMv2の量子化

GPT-4によるコード分析

量子化対象の検討

解析対象: AIMv2モデルの layers.py

量子化対象:
- SwiGLUFFN内のnn.Linear層:
  - 高い計算負荷を持つため、量子化による効率化の効果が大きい。
  - 精度への影響も比較的少ない。
量子化を避けるべき層:
- TextPreprocessorのnn.Embedding層:
  - モデルの埋め込み性能に影響しやすいため、量子化は避ける。
- RMSNormやAttention層:
  - 精度への影響が大きいため、量子化対象外。

量子化の流れ

方法: PyTorchのtorch.quantization.quantize_dynamicを使用。
対象: nn.Linear層のみを量子化（dtype=torch.qint8）。
効果: 計算コストを削減しつつ、精度を維持。

4. 実験と結果

環境設定

画像サイズ: 384x640
テキスト条件: ["cola can", "pepsi can", "sprite can", "fanta can"]
類似度閾値: 0.3
量子化: nn.Linear層を量子化（qint8形式）。

推論結果の比較

項目	量子化前	量子化後
総推論時間（AIMv2）	約7.68秒	約2.81秒(毎回変わる下の結果は0.1秒早い)
精度	高（閾値 0.8）	中（閾値 0.3）
メモリ使用量	高	低

image 1/1 /media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola4.jpg: 384x640 4 vases, 39.1ms
Speed: 2.1ms preprocess, 39.1ms inference, 82.0ms postprocess per image at shape (1, 3, 384, 640)
Refined Results:
Region (492, 83, 631, 323) Score: 0.93
Region (352, 82, 477, 322) Score: 0.74
Region (43, 82, 182, 322) Score: 0.61
Region (199, 82, 326, 322) Score: 0.45
AIMv2 Total Inference Time: 2.7286 seconds

可視化結果

AIMv2の条件検索結果（青枠）

最終コード

main_quantization.py


from ultralytics import YOLO
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoProcessor, AutoModel
from torch.quantization import quantize_dynamic
import torch
import time

# YOLOモデルのロード
yolo_model = YOLO("yolov8n.pt")

# 入力画像パス
image_path = "/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola4.jpg"
font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"  # フォントパス
font_size = 17
font = ImageFont.truetype(font_path, font_size)

# AIMv2モデルのロードと量子化
print("Loading and quantizing AIMv2 model...")
start_load = time.time()
processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", trust_remote_code=True)

# モデルを量子化
quantized_model = quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8  # 量子化する層を指定
)
end_load = time.time()
print(f"Model loaded and quantized in {end_load - start_load:.2f} seconds")

# 条件テキスト
query_text = ["pepsi can", "cola can", "sprite can", "fanra can"]
threshold = 0.3  # 類似度の閾値

# YOLOで物体検出
image = Image.open(image_path).convert("RGB")
width, height = image.size
results = yolo_model(image_path)
detections = results[0].boxes

# YOLOの結果を描画
draw_yolo = image.copy()
draw = ImageDraw.Draw(draw_yolo)

for detection in detections:
    x1, y1, x2, y2 = map(int, detection.xyxy[0])
    label = yolo_model.names[int(detection.cls)]
    confidence = detection.conf.item()

    draw.rectangle([(x1, y1), (x2, y2)], outline="green", width=2)
    draw.text((x1, y1 - 20), f"{label}: {confidence:.2f}", fill="green", font=font)

# AIMv2で条件に一致する領域を特定
refined_results = []
aim_start = time.time()

for detection in detections:
    x1, y1, x2, y2 = map(int, detection.xyxy[0])
    region = image.crop((x1, y1, x2, y2))

    # AIMv2で評価
    inputs = processor(images=region, text=query_text, return_tensors="pt", padding=True)
    outputs = quantized_model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=-1)

    for i, score in enumerate(probs[0]):
        if score > threshold:
            refined_results.append((x1, y1, x2, y2, query_text[i], score.item()))

aim_end = time.time()

# AIMv2の結果を描画
draw_aimv2 = image.copy()
draw = ImageDraw.Draw(draw_aimv2)

print("Refined Results:")
for x1, y1, x2, y2, label, score in refined_results:
    draw.rectangle([(x1, y1), (x2, y2)], outline="blue", width=2)
    draw.text((x1, y1 - 20), f"{label}: {score:.2f}", fill="blue", font=font)
    print(f"Region ({x1}, {y1}, {x2}, {y2}) Score: {score:.2f}")

# 結果を表示
draw_yolo.show()  # YOLOの結果
draw_aimv2.show()  # AIMv2の結果

# 時間計測結果
print(f"AIMv2 Total Inference Time: {aim_end - aim_start:.4f} seconds")

5. 考察

成果

高速化:
- AIMv2の線形層を量子化したことで、推論時間を約 63% 削減。
精度と速度のトレードオフ:
- 閾値を下げることで、柔軟性を維持しつつ実用性を確保。(ただ量子化したら精度が落ちることがわかってるから結果を見たいから、下げました)

課題

リアルタイム性の向上:
- 並列化やバッチ処理の最適化によりさらなる高速化が必要。
量子化対象の最適化:
- Attention層の効率化を進める可能性。

6. まとめ

今回の記事では、YOLOとAIMv2を組み合わせたシステムの効率化を目指し、AIMv2を量子化する取り組みを行いました。結果として、推論速度の大幅な改善が見られましたが、精度面でのトレードオフが課題として残っています。

今後は、動画像への応用や追加の最適化を試み、さらなる改善を目指します。

参考リンク

GitHubリポジトリ: syun88
AIMv2公式リポジトリ: AIMv2 GitHub
論文リンク: arXiv: AIMv2

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up