AIモデルを遊ぶAdvent Calendar 2024

AppleのAIMv2でマルチモーダル機能を活用編3：YOLOとweb camera でリアルタイムでテキストで特定物体を検索

Posted at 2024-12-07

AppleのAIMv2でマルチモーダル機能を活用編3：YOLOとweb camera でリアルタイムでテキストで特定物体を検索

こんにちは、しゅんです！
前回の記事では、AppleのAIMv2とYOLOを組み合わせた特定物体のテキスト検索を紹介しました。今回はその応用として、リアルタイムでの物体検出と条件検索を実現したシステムについて解説します。この新しいアプローチでは、YOLOによる物体検出の後、AIMv2を使って詳細な条件でのフィルタリングを行います。動画も用意しているので、ぜひ最後までご覧ください！

公式リポジトリや論文はこちらを参照してください：

背景と目的

従来の物体検出モデルでは、YOLOのように高速かつ高精度な検出が可能ですが、例えば「緑のPringles」や「トマトスープ缶」のような条件での検索は苦手です。一方、AIMv2はテキストと画像の関連性を測る能力に優れており、柔軟な検索が可能です。
そこで、「YOLOで物体を検出した後にAIMv2でフィルタリングを行えば、特定条件での検索を高速かつ高精度に行えるのではないか」と考え、リアルタイムシステムの構築に挑戦しました。

システム概要

今回のシステムは以下の3つのステップで動作します：

YOLOによる物体検出
- Webカメラから取得した画像をYOLOv8で処理し、物体のバウンディングボックスを取得します。
AIMv2による条件付きフィルタリング
- YOLOで検出した領域をAIMv2に渡し、指定したテキスト条件（例: "green Pringles Chips can"）との関連スコアを計算。
- スコアが閾値（今回は0.8）を超えた領域のみを結果として抽出します。
結果の可視化
- YOLOの検出結果を緑色の枠で描画。
- AIMv2で条件に一致した領域を青色の枠で描画します。

コード

以下は今回使用したPythonコード。Webカメラから取得した映像を処理し、YOLOとAIMv2を連携させてリアルタイムで物体検索を行います。

main_web_camera.py

import cv2
from ultralytics import YOLO
from PIL import Image, ImageDraw,ImageFont
from transformers import AutoProcessor, AutoModel
import torch
import numpy as np

# YOLOモデルのロード
yolo_model = YOLO("yolov8n.pt")  # 軽量モデル推奨

# AIMv2モデルのロード
processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", trust_remote_code=True)
font_path = "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"  # Ubuntu標準のフォントパス
font_size = 17  # テキストサイズ
font = ImageFont.truetype(font_path, font_size)# AIMv2モデルのロード
# 条件テキスト
# query_text = ["baseball", "green Pringles Chips can", " red Pringles Chips can","Tomato Soup can","Tomato"]
query_text = ["iphone", "ipad", "headphone","Apple watch","white book","green book"]

# 類似度の閾値
threshold = 0.8

# Webカメラを開く
cap = cv2.VideoCapture(0)  # デバイス番号を適宜設定
if not cap.isOpened():
    print("Webカメラを開けませんでした。")
    exit()

try:
    while True:
        # Webカメラからフレームを取得
        ret, frame = cap.read()
        if not ret:
            print("フレームを取得できませんでした。")
            break

        # OpenCVのBGR画像をPillowのRGB画像に変換
        image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        width, height = image.size

        # YOLOで物体検出
        results = yolo_model(frame)
        detections = results[0].boxes

        # スケールを計算
        inference_shape = results[0].orig_shape  # YOLO推論時の画像サイズ
        scale_x = width / inference_shape[1]
        scale_y = height / inference_shape[0]

        # YOLOの結果を描画
        draw_yolo = image.copy()
        draw = ImageDraw.Draw(draw_yolo)

        refined_regions = []
        regions = []

        for detection in detections:
            x1, y1, x2, y2 = map(int, detection.xyxy[0])
            x1 = int(x1 * scale_x)
            y1 = int(y1 * scale_y)
            x2 = int(x2 * scale_x)
            y2 = int(y2 * scale_y)

            label = yolo_model.names[int(detection.cls)]
            confidence = detection.conf.item()

            draw.rectangle([(x1, y1), (x2, y2)], outline="green", width=3)
            draw.text((x1, y1 - 20), f"{label}: {confidence:.2f}", fill="green",font=font)

            # YOLOで検出した領域を切り出し（フィルタリングなし）
            regions.append(image.crop((x1, y1, x2, y2)))

        # バッチ処理でAIMv2に渡す
        if regions:
            inputs = processor(images=regions, text=query_text, return_tensors="pt", padding=True)
            outputs = model(**inputs)
            probs = outputs.logits_per_image.softmax(dim=-1)

            for region_idx, region_probs in enumerate(probs):
                for i, score in enumerate(region_probs):
                    if score > threshold:
                        x1, y1, x2, y2 = map(int, detections[region_idx].xyxy[0])
                        x1 = int(x1 * scale_x)
                        y1 = int(y1 * scale_y)
                        x2 = int(x2 * scale_x)
                        y2 = int(y2 * scale_y)
                        refined_regions.append((x1, y1, x2, y2, query_text[i], score.item()))

        # AIMv2の結果を描画
        draw_aimv2 = image.copy()
        draw = ImageDraw.Draw(draw_aimv2)

        print("Refined Results:")
        for x1, y1, x2, y2, label, score in refined_regions:
            draw.rectangle([(x1, y1), (x2, y2)], outline="blue", width=3)
            draw.text((x1, y1 - 20), f"{label}: {score:.2f}", fill="blue",font=font)
            print(f"Region ({x1}, {y1}, {x2}, {y2}) Score: {score:.2f}")

        # 結果を表示
        yolo_frame = cv2.cvtColor(np.array(draw_yolo), cv2.COLOR_RGB2BGR)
        aimv2_frame = cv2.cvtColor(np.array(draw_aimv2), cv2.COLOR_RGB2BGR)

        # OpenCVでフレームを表示
        cv2.imshow("YOLO Results", yolo_frame)
        cv2.imshow("AIMv2 Results", aimv2_frame)

        # 'q'キーで終了
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

finally:
    cap.release()
    cv2.destroyAllWindows()

結果の動画

まとめ

今回の取り組みでは、YOLOとAIMv2を連携させることで、リアルタイムでの物体検出と詳細条件検索を実現しました。この方法により、ラベリング作業の手間を削減し、柔軟で応用性の高いシステムを構築できます。ただし、処理速度のさらなる向上が今後の課題となります。
今回も最後まで読んでいただき、ありがとうございました！今後も新しい試みを記事にしていくので、ぜひXもフォローしてください！

YouTube登録もよろしくお願いします！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up