Gemini 2.0 Flashでの2D空間認識(物体検出)を試してみた

Last updated at 2024-12-24Posted at 2024-12-24

Gemini 2.0 Flash とは

12/11にGoogleから発表されたGeminiの最新バージョンです。
様々な新機能が追加されて、大きな話題を呼びました。

Googleの公式cookbook

GoogleがGemini 2.0 Flashを使用したcookbookを公開しています。

中身を見ていて、2Dの空間認識(物体検出)ができるとあったので実際に試してみました。

方法

物体検出をしたい画像を撮影しました。
写っているものは、ペン、USBメモリ、電卓になります。

from PIL import Image

image = "/content/unnamed.jpg" 

im = Image.open(image)

バウンディングボックスをJSONで返したり、オブジェクトが複数あった場合の条件などをシステムインストラクションに記載しています。
そして、プロンプトで今回はペンを検出するように指示しました。

from io import BytesIO

from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

bounding_box_system_instructions = """
境界ボックスをラベル付きの JSON 配列として返します。マスクやコードフェンスは絶対に返品しないでください。オブジェクトは 25 個に制限します。
オブジェクトが複数存在する場合は、その固有の特性 (色、サイズ、位置、固有の特徴など) に従って名前を付けます。
"""

safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

prompt = "ペンの2次元バウンディングボックスを検出する。"

# Load and resize image
img = Image.open(BytesIO(open(image, "rb").read()))
im = Image.open(image).resize((1024, int(1024 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)
   
response = client.models.generate_content(
    model='gemini-2.0-flash-exp', 
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

print(response.text)

Geminiからの出力として、以下のバウンディングボックスが返ってきました。

[
  {"box_2d": [181, 16, 716, 189], "label": "red pen"},
  {"box_2d": [219, 256, 671, 504], "label": "transparent pen"}
]

バウンディングボックス描画コード

バウンディングボックス描画

import json
import random
import io
from PIL import Image, ImageDraw, ImageFont
from PIL import ImageColor

additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]

def plot_bounding_boxes(im, bounding_boxes):
    """
    Plots bounding boxes on an image with markers for each a name, using PIL, normalized coordinates, and different colors.

    Args:
        img_path: The path to the image file.
        bounding_boxes: A list of bounding boxes containing the name of the object
         and their positions in normalized [y1 x1 y2 x2] format.
    """

    # Load the image
    img = im
    width, height = img.size
    print(img.size)
    # Create a drawing object
    draw = ImageDraw.Draw(img)

    # Define a list of colors
    colors = [
    'red',
    'green',
    'blue',
    'yellow',
    'orange',
    'pink',
    'purple',
    'brown',
    'gray',
    'beige',
    'turquoise',
    'cyan',
    'magenta',
    'lime',
    'navy',
    'maroon',
    'teal',
    'olive',
    'coral',
    'lavender',
    'violet',
    'gold',
    'silver',
    ] + additional_colors

    # Parsing out the markdown fencing
    bounding_boxes = parse_json(bounding_boxes)

    font = ImageFont.truetype("NotoSansCJK-Regular.ttc", size=14)

    # Iterate over the bounding boxes
    for i, bounding_box in enumerate(json.loads(bounding_boxes)):
      # Select a color from the list
      color = colors[i % len(colors)]

      # Convert normalized coordinates to absolute coordinates
      abs_y1 = int(bounding_box["box_2d"][0]/1000 * height)
      abs_x1 = int(bounding_box["box_2d"][1]/1000 * width)
      abs_y2 = int(bounding_box["box_2d"][2]/1000 * height)
      abs_x2 = int(bounding_box["box_2d"][3]/1000 * width)

      if abs_x1 > abs_x2:
        abs_x1, abs_x2 = abs_x2, abs_x1

      if abs_y1 > abs_y2:
        abs_y1, abs_y2 = abs_y2, abs_y1

      # Draw the bounding box
      draw.rectangle(
          ((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=4
      )

      # Draw the text
      if "label" in bounding_box:
        draw.text((abs_x1 + 8, abs_y1 + 6), bounding_box["label"], fill=color, font=font)

    # Display the image
    img.show()

JSON文字列のパース

def parse_json(json_output):
    # Parsing out the markdown fencing
    lines = json_output.splitlines()
    for i, line in enumerate(lines):
        if line == "```json":
            json_output = "\n".join(lines[i+1:])  # Remove everything before "```json"
            json_output = json_output.split("```")[0]  # Remove everything after the closing "```"
            break  # Exit the loop once "```json" is found
    return json_output

結果

物体検出結果を出力する。

plot_bounding_boxes(im, response.text)

写真内の2本のペンをしっかり検出できていました！
物体検出といえば、CNNやYOLOなどのモデルが有名ですが、LLMでもここまでできるようになるとは。
座標までも出力できるとは思っていなかったので、LLMの可能性がより広がったような気がしました。

全ての物体検出をしてみる

プロンプトを以下に変更してみました。

prompt = "画像内の物体の2次元バウンディングボックスを検出する"

ペン以外にも、電卓やUSBメモリも検出してくれました！

おわりに

Gemini 2.0 Flashで物体検出を行ってみました。
今回はペン、電卓など一般的なものの検出でしたので、Geminiが検出できない物体には何があるのか、色々試してみたくなりました！

Geminiは、3D空間認識もできるらしいので今度試してみたい。
https://github.com/google-gemini/cookbook/blob/main/gemini-2/spatial_understanding_3d.ipynb

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up