GPT-4o Visionに物体検出させてみる

Last updated at 2024-06-03Posted at 2024-05-25

GPT-4o Visionに物体検出させてみる

評判のGPT-4o Visionでバウンディングボックスの物体検出ができるか試してみました。

1. GPT-4o Visionを使ってみる

公式のAPIリファレンスに沿ってAPIでGPT-4o Visionを試します。

import base64
import requests

image_path = "HERE_IS_YOUR_IMAGE_PATH"
api_key = "HERE_IS_YOUR_OPENAI_API_KEY"

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Getting the base64 string
base64_image = encode_image(image_path)

prompt_message = "How many people are in this picture?"

print(prompt_message)

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"
}

payload = {
    "model": "gpt-4o",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt_message,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json()["choices"][0]["message"]["content"])

Question: 'How many people are in this picture?' (この写真に写っている人は何人ですか？)
Answer: 'There are 12 people in the picture.' (12人です。)

完璧です。単純な人数カウントはGPT-4turbo Visionでは難しかったですができるようになってます。

しかし、このレベルは既存の物体検出モデルでも余裕です。

DETR Object Detectionで実行した結果、こっちもちゃんと１２人数えられています。

2. どこまで見えているのか

同じ画像で帽子かぶっている人を数えてみます。結果は以下の通り。

Question: 'How many people in this picture are wearing hats and how many are not?' (何人が帽子をかぶっていて、何人はそうではないですか？)
Answer: 'In the picture, there are 12 people in total. Out of these, 9 people are wearing hats and 3 people are not wearing hats.' (12人中、9人が帽子をかぶっていて、３人はかぶってないです。)

目視だと「10人帽子かぶっていて、２人被ってない」ですが、向かって右奥の女性はかぶっているか微妙なとこなので正解で扱いたいと思います。かなり細かい箇所は微妙ですが、大きくは人などの物体が判断できていそうです。

3. 物体検出させてみる

まずは物体検出として振る舞うように画像とプロンプトを与えてみます。

import base64
import requests
import json

image_path = "HERE_IS_YOUR_IMAGE_PATH"
api_key = "HERE_IS_YOUR_OPENAI_API_KEY"

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

image = Image.open(input_path)

# Getting the base64 string
base64_image = encode_image(image_path)

template_text = """
You are an AI model that performs object detection. You are trained by MS COCO dataset. 
Please perform object detection and return the coordinates of the bounding box in json format.
Only json is required for the answer, please act as an API. Please use the following format.
{
    "detections": [
        {
            "category": "person",
            "bbox": [50, 30, 100, 200],
            "score": 0.98
        },
        {
            "category": "dog",
            "bbox": [200, 150, 120, 180],
            "score": 0.92
        },
        {
            "category": "cat",
            "bbox": [300, 100, 80, 160],
            "score": 0.85
        }
    ]
}

note: bbox format is [x left coordinate, y top coordinates, x right coordinates, y bottom coordinates]. score is confidence.

"""

additional_info = f"This image size is ({image.size[0]}, {image.size[1]})."
prompt_message = template_text + additional_info

print(prompt_message)

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"
}

payload = {
    "model": "gpt-4o",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt_message,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

data =  json.loads(response.json()["choices"][0]["message"]["content"])
data

結果、座標データが返ってきました。

{
    "detections": [
        {
            "category": "person",
            "bbox": [30, 50, 100, 240],
            "score": 0.95
        },
        {
            "category": "person",
            "bbox": [120, 40, 190, 260],
            "score": 0.93
        },
        {
            "category": "person",
            "bbox": [220, 60, 290, 250],
            "score": 0.91
        },
        {
            "category": "person",
            "bbox": [320, 40, 390, 240],
            "score": 0.94
        },
        {
            "category": "person",
            "bbox": [420, 50, 490, 250],
            "score": 0.92
        },
        {
            "category": "person",
            "bbox": [520, 60, 590, 260],
            "score": 0.9
        }
    ]
}

描画してみます。

from PIL import Image, ImageDraw
from io import BytesIO
import matplotlib.pyplot as plt

image = Image.open(image_path)
draw = ImageDraw.Draw(image)

for detection in data['detections']:
    bbox = detection['bbox']
    category = detection['category']
    score = detection['score']

    left, top, right, bottom = bbox
    box = [(left, top), (right, bottom)]

    draw.rectangle(box, outline="red", width=2)
    
    text = f"{category}: {score:.2f}"
    draw.text((left, top), text, fill="red")

plt.figure(figsize=(10, 10))
plt.imshow(image)
plt.axis('off')
plt.show()

正しく出力されていないようです。別の画像で試してみます。

正しく出力されないので他の方法を試してみます。

4. GPT-4o Vision + Few-shotプロンプトで物体検出に挑戦

３枚画像データを入力し、内２枚は正解の座標を与え、最後の１枚の結果を求めるようにしてみます。いわゆるFew-shotプロンプトです。

from openai import OpenAI

template_text = """
You are an AI model that performs object detection. You are trained by MS COCO dataset. 
Please perform object detection and return the coordinates of the bounding box in json format.
Only json is required for the answer, please act as an API. Please use the following format.
{
    "detections": [
        {
            "category": "person",
            "bbox": [50, 30, 100, 200],
            "score": 0.98
        },
        {
            "category": "dog",
            "bbox": [200, 150, 120, 180],
            "score": 0.92
        },
        {
            "category": "cat",
            "bbox": [300, 100, 80, 160],
            "score": 0.85
        }
    ]
}

note: bbox format is [x left coordinate, y top coordinates, x right coordinates, y bottom coordinates]. score is confidence.

For example, first image's result is below.
{
    "detections": [
        {
            "category": "person",
            "bbox": [
                280,
                44,
                218,
                346
            ],
            "score": 0.9
        },
        {
            "category": "skis",
            "bbox": [
                205,
                362,
                409,
                38
            ],
            "score": 0.8
        }
    ]
}

Second image's result is below.
{
    "detections": [
        {
            "category": "sports ball",
            "bbox": [
                408,
                172,
                19,
                16
            ],
            "score": 0.9
        },
        {
            "category": "person",
            "bbox": [
                145,
                100,
                291,
                457
            ],
            "score": 0.9
        },
        {
            "category": "person",
            "bbox": [
                163,
                126,
                265,
                480
            ],
            "score": 0.9
        },
        {
            "category": "baseball glove",
            "bbox": [
                368,
                157,
                57,
                45
            ],
            "score": 0.8
        }
    ]
}

Third image's result is below.
"""

print(template_text)

client = OpenAI(
    api_key=api_key,
)
response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": template_text,
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "http://images.cocodataset.org/val2017/000000000785.jpg",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "http://images.cocodataset.org/val2017/000000000872.jpg",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "http://images.cocodataset.org/val2017/000000001000.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)
print(response.choices[0])

座標データが返ってきました。

{
  "detections": [
    { "category": "person", "bbox": [112, 10, 435, 482], "score": 0.98 },
    { "category": "skis", "bbox": [114, 255, 329, 471], "score": 0.95 },
    { "category": "person", "bbox": [150, 108, 366, 478], "score": 0.97 },
    { "category": "sports ball", "bbox": [241, 134, 374, 186], "score": 0.92 },
    { "category": "person", "bbox": [0, 84, 358, 482], "score": 0.99 }
  ]
}

描画してみます。

人数はしっかりわかっているようですが、どこに人がいつか座標データとして出力することは難しいようです。

プロンプトやファインチューニングによっては結果が正しく出るようになるのでしょうか。

引き続き調査を続けてみます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up