AIモデルを遊ぶAdvent Calendar 2024

AppleのAIMv2でマルチモーダル機能を活用編2：「YOLOとの連携で特定物体をテキストで検索」

Last updated at 2024-12-07Posted at 2024-12-04

AppleのAIMv2でマルチモーダル機能を活用編2：「YOLOとの連携で特定物体をテキストで検索」

こんにちは、しゅんです！
前回の記事では、AppleのAIMv2を使った画像領域の特徴量抽出と、テキストとの関連性を可視化する方法について解説しました。今回は、さらに一歩進んで、YOLOとの連携による特定物体のテキスト検索を実現します。
かなり面白い結果出ているのでぜひ最後まで見てください。

公式リポジトリや論文はこちらを参照してください：

背景と目的

物体検出モデルとして広く使われているYOLOは、クラスごとの物体検出が得意ですが、詳細な条件（例: "赤いボトル" や "緑のリンゴ"）で検索することは困難です。こうした場合、事前にデータセットを用意し、モデルにラベリングを施して学習させる必要があります。しかし、この作業は非常に手間と時間がかかるものです。

代替策として、OpenCVを活用する方法も考えられますが、こちらも高度な設定や追加作業が必要です。

AIMv2はテキストと画像の関連性を測る能力に優れており、前回の記事でもその性能を検証しました。
そこで、「YOLOで物体を検出した後、その検出領域をリサイズし、AIMv2と検索したい文字列を組み合わせることで、詳細な条件にも柔軟に対応できるのではないか？」という発想に至りました。

この手法では、事前学習済みのYOLOを使用して物体検出を行い、その後、AIMv2を用いてさらに条件を絞り込むことで、特定の物体や条件に基づく検索を可能にしています。

そこで今回は、YOLOで検出した物体領域をAIMv2で再評価することで、より詳細な条件に基づいた物体検索を実現します。

システム概要

YOLOによる物体検出
- 入力画像内の物体をyolov8nで検出し、バウンディングボックス（領域情報）を取得。
AIMv2による条件付き再評価
- YOLOで得た領域を244,244にresizeしAIMv2に渡し、指定したテキストリスト（最低でも2つあった方が精度が上がることもわかりました）との関連性スコアを計算。
- スコアが閾値を超えた領域を結果として抽出。0.8に設定してる
結果の可視化
- YOLOの検出結果を緑色の枠で描画。
- AIMv2で条件に一致した領域を青色の枠で表示。

実装コード

image_path は適切な場所に設定してください。

from ultralytics import YOLO
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModel
import torch

# YOLOモデルのロード
yolo_model = YOLO("yolov8n.pt")
# image_path = "/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/apple.jpg"
# image_path = "/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola2.jpg"
# image_path = "/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola3.jpg"
image_path = "/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola4.jpg"

# AIMv2モデルのロード
processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", trust_remote_code=True)

# 条件テキスト
# query_text = ["red apple","green apple"]
# query_text = ["cola bottle", "cola can", "cola glass"]
query_text = ["pepsi can", "cola can", "sprite can", "fanra can"]
# 類似度の結果を格納するリスト
high_score_regions = []
threshold = 0.8  # 類似度の閾値
# YOLOで物体検出
image = Image.open(image_path).convert("RGB")
width, height = image.size  # 元画像サイズを取得
results = yolo_model(image_path)
detections = results[0].boxes

# スケールを計算
inference_shape = results[0].orig_shape  # YOLO推論時の画像サイズ
scale_x = width / inference_shape[1]
scale_y = height / inference_shape[0]

# YOLOの結果を描画
draw_yolo = image.copy()
draw = ImageDraw.Draw(draw_yolo)

for detection in detections:
    x1, y1, x2, y2 = map(int, detection.xyxy[0])
    x1 = int(x1 * scale_x)
    y1 = int(y1 * scale_y)
    x2 = int(x2 * scale_x)
    y2 = int(y2 * scale_y)

    label = yolo_model.names[int(detection.cls)]
    confidence = detection.conf.item()

    draw.rectangle([(x1, y1), (x2, y2)], outline="green", width=2)
    draw.text((x1, y1 - 10), f"{label}: {confidence:.2f}", fill="green")

# AIMv2で条件に一致する領域を特定
refined_results = []

for detection in detections:
    x1, y1, x2, y2 = map(int, detection.xyxy[0])
    x1 = int(x1 * scale_x)
    y1 = int(y1 * scale_y)
    x2 = int(x2 * scale_x)
    y2 = int(y2 * scale_y)

    # YOLOで検出した領域を切り出し
    region = image.crop((x1, y1, x2, y2))
    
    # リサイズ
    region_resized = region.resize((224, 224), Image.Resampling.LANCZOS)

    # デバッグ用にリサイズされた画像のサイズを確認
    print(f"Region ({x1}, {y1}, {x2}, {y2}) Resized to: {region_resized.size}")

    # AIMv2で評価
    inputs = processor(images=region_resized, text=query_text, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=-1)
    # 閾値を超える領域を記録
    for i, score in enumerate(probs[0]):
        if score > threshold:
            refined_results.append((x1, y1, x2, y2, query_text[i], score.item()))
        # print(f"Region ({x1}, {y1}, {x2}, {y2}) Score: {score:.2f}")

# AIMv2の結果を描画
draw_aimv2 = image.copy()
draw = ImageDraw.Draw(draw_aimv2)

print("Refined Results:")
for x1, y1, x2, y2, label, score in refined_results:
    draw.rectangle([(x1, y1), (x2, y2)], outline="blue", width=2)
    draw.text((x1, y1 - 10), f"{label}: {score:.2f}", fill="blue")
    print(f"Region ({x1}, {y1}, {x2}, {y2}) Score: {score:.2f}")

# 結果を保存または表示
draw_yolo.show()  # YOLOの結果（緑色の枠）
draw_aimv2.show()  # AIMv2の結果（青色の枠）

実行結果

(.venv) syun@syun:/media/syun/ssd02/python_learning/apple/qiita_project_AIMv2$ python3 aimv2-large-patch14-224-lit/main.py 

image 1/1 /media/syun/ssd02/python_learning/apple/qiita_project_AIMv2/test_search_image/cola4.jpg: 384x640 4 vases, 38.2ms
Speed: 2.0ms preprocess, 38.2ms inference, 77.3ms postprocess per image at shape (1, 3, 384, 640)
Region (492, 83, 631, 323) Resized to: (224, 224)
Region (352, 82, 477, 322) Resized to: (224, 224)
Region (43, 82, 182, 322) Resized to: (224, 224)
Region (199, 82, 326, 322) Resized to: (224, 224)
Refined Results:
Region (492, 83, 631, 323) Score: 0.98
Region (352, 82, 477, 322) Score: 0.99
Region (43, 82, 182, 322) Score: 1.00
Region (199, 82, 326, 322) Score: 1.00

YOLOの結果（緑色の枠）画像
!
AIMv2の結果（青色の枠）画像

動画

まとめ

今回は、YOLOとAIMv2を組み合わせてテキストベースの物体検索を実現しました。これにより、物体検出結果をテキスト条件で精査できるようになり、より高度な画像解析が可能になります。
ただし、場合によって、リサイズや領域のスコアリングに時間がかかるため、さらなる高速化や効率化が今後の課題ですね。

今回も最後まで読んでいただき、ありがとうございました！
ぜひお試しください！

YouTube登録もよろしくお願いします！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up