AIモデルを遊ぶAdvent Calendar 2024

AppleのAIMv2でマルチモーダル機能を活用編6：リアルタイムWebカメラ推論の挑戦

Last updated at 2024-12-19Posted at 2024-12-19

1. はじめに

こんにちは、しゅんです！

この記事では、YOLOとAIMv2を組み合わせたリアルタイム物体検出システムを構築し、大幅にスピードアップすることができました。Webカメラを用いてリアルタイム処理の可能性を探ります。FPSの計測を行い、処理速度とシステムの実用性を検証しました。

2. 背景と目的

従来の画像認識システムでは、画像ファイルを入力として処理していました。しかし、リアルタイム処理が求められるシーン（監視カメラ、ロボット操作など）では、カメラからの映像を直接処理する必要があります。

今回の目的

リアルタイム性の確認：Webカメラ入力でシステムがどの程度のFPSを達成できるかを検証。
精度の維持：YOLOによる物体検出とAIMv2による条件フィルタリングを適切に組み合わせ、精度を損なわずにリアルタイム処理を実現する。

3. 実装内容

コードの説明

今回のコードでは、以下を実装しました：

YOLOによる物体検出：Webカメラ映像から物体を検出。
AIMv2による条件付きフィルタリング：検出領域をテキスト条件と照らし合わせて評価。
FPSの計測と表示：リアルタイム処理性能を評価するため、画面上にFPSを表示。

from ultralytics import YOLO
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoProcessor, AutoModel
import torch
import time
import cv2
import numpy as np

# YOLOモデルのロード
yolo_model = YOLO("yolov8n.pt")

# フォント設定
font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"  # フォントパス
font_size = 17
font = ImageFont.truetype(font_path, font_size)

# AIMv2モデルのロード
print("Loading AIMv2 model...")
start_load = time.time()
processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", trust_remote_code=True).to("cuda")
end_load = time.time()
print(f"Model loaded in {end_load - start_load:.2f} seconds")

# 条件テキスト
# query_text = ["pepsi can", "cola can", "sprite can", "fanta can"]
query_text = ["iphone", "ipad", "headphone", "Apple watch", "white book", "green book"]
threshold = 0.3  # 類似度の閾値

# Webカメラの設定
# cap = cv2.VideoCapture(0)
cap = cv2.VideoCapture(2)
if not cap.isOpened():
    print("Webカメラを開けませんでした。")
    exit()

# FPS計算用
frame_count = 0
start_time = time.time()

# YOLOの結果描画
def draw_yolo_results(image, detections):
    draw = ImageDraw.Draw(image)
    for detection in detections:
        x1, y1, x2, y2 = map(int, detection.xyxy[0])
        label = yolo_model.names[int(detection.cls)]
        confidence = detection.conf.item()
        draw.rectangle([(x1, y1), (x2, y2)], outline="green", width=2)
        draw.text((x1, y1 - 20), f"{label}: {confidence:.2f}", fill="green", font=font)
    return image

# AIMv2の結果描画
def draw_aimv2_results(image, results):
    draw = ImageDraw.Draw(image)
    for x1, y1, x2, y2, label, score in results:
        draw.rectangle([(x1, y1), (x2, y2)], outline="blue", width=2)
        draw.text((x1, y1 - 20), f"{label}: {score:.2f}", fill="blue", font=font)
    return image

print("Starting video stream...")
try:
    while True:
        ret, frame = cap.read()
        if not ret:
            print("フレームを取得できませんでした。")
            break

        frame_count += 1

        # OpenCVのBGR画像をPillowのRGB画像に変換
        pil_image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        # YOLOによる物体検出
        results = yolo_model(frame)
        detections = results[0].boxes
        draw_yolo = draw_yolo_results(pil_image.copy(), detections)

        # AIMv2による条件一致領域の特定
        refined_results = []
        for detection in detections:
            x1, y1, x2, y2 = map(int, detection.xyxy[0])
            region = pil_image.crop((x1, y1, x2, y2))

            # AIMv2で評価
            inputs = processor(images=region, text=query_text, return_tensors="pt", padding=True).to("cuda")
            outputs = model(**inputs)
            probs = outputs.logits_per_image.softmax(dim=-1)

            for i, score in enumerate(probs[0]):
                if score > threshold:
                    refined_results.append((x1, y1, x2, y2, query_text[i], score.item()))

        draw_aimv2 = draw_aimv2_results(pil_image.copy(), refined_results)

        # OpenCVで表示
        frame_with_yolo = cv2.cvtColor(np.array(draw_yolo), cv2.COLOR_RGB2BGR)
        frame_with_aimv2 = cv2.cvtColor(np.array(draw_aimv2), cv2.COLOR_RGB2BGR)

        cv2.imshow("YOLO Results", frame_with_yolo)
        cv2.imshow("AIMv2 Results", frame_with_aimv2)

        # FPS計算と表示
        elapsed_time = time.time() - start_time
        fps = frame_count / elapsed_time
        cv2.putText(frame_with_aimv2, f"FPS: {fps:.2f}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
        cv2.imshow("AIMv2 Results", frame_with_aimv2)
        print(fps)

        if cv2.waitKey(1) & 0xFF == ord("q"):
            break
finally:
    cap.release()
    cv2.destroyAllWindows()
    print("Video stream ended.")

主な改良点

リアルタイム処理への対応：
- Webカメラ入力を処理できるよう、OpenCVを活用。
- YOLOとAIMv2の処理を組み合わせ、フレーム単位での推論を実現。
パフォーマンス測定：
- 処理時間を計測し、FPSを画面上に表示。
結果の可視化：
- YOLOとAIMv2の推論結果をそれぞれ別ウィンドウで表示。

4. 実験と結果

FPSの計測結果

FPSは、システムの状態や入力データの内容に応じて変動しました。

状態	FPSの範囲
カメラ映像が単純な場合	約15–18 FPS
複雑な物体が多い場合	約5–7 FPS

動作の安定性

安定して動作する環境を確認。
高負荷時に処理遅延が発生する場合があり、さらなる最適化が必要。

5. 考察

成果

リアルタイム処理：Webカメラ入力に対し、実用的なFPS（最大18 FPS）を達成。
可視化の向上：物体検出結果と条件一致結果を別々に表示し、システムの動作を直感的に理解可能。

動画

今後の課題

処理速度のさらなる向上：
- バッチ処理の導入で並列計算を促進。
- モデルの軽量化（例えば、AIMv2の量子化）。
精度の最適化：
- 条件マッチングの閾値を動的に調整する仕組みを検討。

6. まとめ

今回の記事では、YOLOとAIMv2を用いてWebカメラ映像をリアルタイムで処理するシステムを構築しました。システムの動作は安定しており、最大で18 FPSを達成しました。

今後は処理速度と精度の両立を目指し、さらなる改善に取り組む予定です。

7. 参考リンク

GitHubリポジトリ: syun88
AIMv2公式リポジトリ: AIMv2 GitHub
論文リンク: arXiv: AIMv2

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up