はじめに
Gemini 3 Flash に Agentic Vision の機能が追加されました。
これは、視覚的な推論とコード実行を組み合わせ、視覚的な証拠に基づいて回答を導き出す機能となります。
この機能を利用すると、ほとんどのビジョンベンチマークで一貫して 5 ~ 10% の品質向上が実現しています。
Agentic Vision は、画像理解タスクにエージェントの「考える(Think)、行動する(Act)、観察する(Observe)」ループを導入しています。
- Think
1.モデルはユーザーのクエリと初期イメージを分析し、複数ステップのプランを策定 - Act
2. モデルは、画像を分析 するための Python コードを生成して実行 - Observe
3. 変換された画像はモデルのコンテキストウィンドウに追加され、モデルは最終的なレスポンスを生成する前に、より適切なコンテキストで新しいデータを検査
https://blog.google/innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/
ユースケースとして以下が挙げられています。
| ユースケース | 内容の詳細 |
|---|---|
| ズームして検査する | 詳細が小さすぎて認識できない場合、モデルが自律的に判断して特定領域をクロップ(切り抜き) |
| ビジュアル数学 | 画像内の視覚情報に基づき、コードを実行して多段階の計算を行う |
| 画像アノテーション | 画像上に視覚的な印を付けることで、質問に対する回答や説明を補足 |
今回は Agentic Vision を試してみました!
実行
コード
Agentic Vision を使用するには、tools で ToolCodeExecution を有効にします。
import PIL.Image
import io
from google import genai
from google.genai import types
with open('picture.png', 'rb') as f:
image_bytes = f.read()
client = genai.Client()
image = types.Part.from_bytes(
data=image_bytes,
mime_type="image/png",
)
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents=[image, "指示文"],
config=types.GenerateContentConfig(
tools=[types.Tool(code_execution=types.ToolCodeExecution)]
),
)
for part in response.candidates[0].content.parts:
if part.text is not None:
print(part.text)
if part.executable_code is not None:
print(part.executable_code.code)
if part.code_execution_result is not None:
print(part.code_execution_result.output)
if part.as_image() is not None:
# display() is a standard function in Jupyter/Colab notebooks
display(PIL.Image.open(io.BytesIO(part.as_image().image_bytes)))
対象画像
架空の会社の売上帳票の画像を使用します。
ズームして検査する
2月第1週を切り取ってください
実際にコードを生成して実行して、何回か繰り返して目的の画像を生成しています。
生成されたコード
import cv2
import numpy as np
from PIL import Image
# Load the image to get its dimensions
img = cv2.imread('input_file_0.png')
height, width, _ = img.shape
# Define the area for the header and the first week (Feb 1 to Feb 7)
# Based on the image structure, it's a table.
# Header is at the top.
# Rows are roughly equal height.
# Let's try to detect rows.
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Find horizontal lines to separate rows
# Actually, it's better to just estimate based on the image content.
# The table starts from the top.
# Header row height is roughly 40-50 pixels.
# Data rows are also roughly 35-40 pixels.
# Let's perform some detection or just manual cropping if the layout is clear.
# Header row is the first row.
# Feb 1 to Feb 7 are the next 7 rows.
# Looking at the image, there are 28 days in Feb 2026.
# Header row + 28 rows.
# Total height is around 1000 pixels.
# Header height ~ 45 pixels.
# Row height ~ (1000 - 45) / 28 ~ 34 pixels.
# Let's try to more accurately find row boundaries.
edges = cv2.Canny(gray, 50, 150, apertureSize=3)
lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, minLineLength=width*0.5, maxLineGap=10)
y_coords = []
if lines is not None:
for line in lines:
x1, y1, x2, y2 = line[0]
if abs(y1 - y2) < 5: # horizontal line
y_coords.append(y1)
y_coords = sorted(list(set(y_coords)))
# Clean up y_coords - keep those that are at least some distance apart
distinct_y = []
if y_coords:
distinct_y.append(y_coords[0])
for y in y_coords[1:]:
if y - distinct_y[-1] > 20:
distinct_y.append(y)
# Let's print distinct_y to see the row boundaries
print(f"Distinct horizontal lines: {distinct_y}")
# Based on distinct_y, we can find the header and the first 7 rows.
# The header is above distinct_y[0] if distinct_y[0] is the first separator.
# Actually, the first line is usually the top of the header.
# Let's look at the image. The very top has a header.
# Header row seems to be the area above the first data row.
# In the image, the header has a dark blue background.
# Let's just crop based on detected lines.
# If distinct_y[0] is the top of the table, then:
# row 1 (header): distinct_y[0] to distinct_y[1]
# row 2 (Feb 1): distinct_y[1] to distinct_y[2]
# ...
# row 8 (Feb 7): distinct_y[7] to distinct_y[8]
# Wait, the first row of data is Feb 1st.
# Let's refine the y_coords.
# I'll just use the first 8-9 horizontal lines if they look correct.
# Actually, the table starts from the very top? No, there is some margin.
# Let's re-examine.
# Top blue bar starts at y=0 or close to it.
# Let's find the header.
# Header has blue color.
# Sum of intensities might show where rows are.
row_sums = np.sum(gray, axis=1) / width
# Look for local minima in row_sums to find lines between rows
# (assuming lines are darker or different)
# Or just use the Hough lines.
# Crop: Header + Feb 1-7
# If distinct_y is correct:
# First week would be from distinct_y[0] to distinct_y[8]
# Let's verify with an image save.
crop_y_start = distinct_y[0] if len(distinct_y) > 0 else 0
crop_y_end = distinct_y[8] if len(distinct_y) > 8 else height
# Note: header is usually between distinct_y[0] and distinct_y[1]
# data rows follow.
# Let's also find the left and right boundaries of the table.
# Sum along columns.
col_sums = np.sum(gray, axis=0) / height
# Look for where the table starts and ends horizontally.
# Usually there's a margin.
left = 0
right = width
# Find non-white areas
threshold = 250
is_not_white = (gray < threshold).any(axis=0)
if is_not_white.any():
left = np.argmax(is_not_white)
right = width - np.argmax(is_not_white[::-1])
# Just to be safe, I'll take a bit wider crop vertically first to see if I got the right rows.
# Actually, let's just use the indices.
# Row 0: Header
# Row 1: Feb 1
# Row 2: Feb 2
# Row 3: Feb 3
# Row 4: Feb 4
# Row 5: Feb 5
# Row 6: Feb 6
# Row 7: Feb 7
# We need these 8 rows.
# Re-inspecting distinct_y:
# They might not be exactly row separators if some lines are missing.
# Let's try to detect the blue header.
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
lower_blue = np.array([100, 50, 50])
upper_blue = np.array([130, 255, 255])
mask = cv2.inRange(hsv, lower_blue, upper_blue)
blue_rows = np.where(np.any(mask, axis=1))[0]
if len(blue_rows) > 0:
header_top = blue_rows[0]
header_bottom = blue_rows[-1]
# Header height
h_height = header_bottom - header_top
# Estimating row height based on Feb 1-28
# There are 28 rows of data.
# Let's find the bottom of the table.
# The table has a border.
# Let's just crop based on the found blue header and then 7 rows.
# Average data row height can be estimated.
# Or just use the horizontal lines again.
# Let's try a simple crop first.
# From the image, the header is at the top.
# Total image height is 1021.
# Header ends around y=44.
# Feb 1 ends around y=78.
# Feb 2 ends around y=112.
# So row height is ~34 pixels.
# Header height is ~44 pixels.
# Crop for header + 7 days: 44 + 7 * 34 = 44 + 238 = 282.
# Let's use the detected distinct_y if it makes sense.
# I will output the crop.
# I will use distinct_y[0] to distinct_y[8] if they exist and seem reasonable.
if len(distinct_y) >= 9:
crop_img = img[distinct_y[0]:distinct_y[8]+2, :, :]
cv2.imwrite('first_week.png', crop_img)
else:
# fallback to estimated
cv2.imwrite('first_week.png', img[0:300, :, :])
Distinct horizontal lines: [np.int32(13), np.int32(60), np.int32(1327)]
# Let's check the distinct_y values and refine.
print(distinct_y)
# I'll also try to detect more lines by lowering the threshold.
edges = cv2.Canny(gray, 30, 100, apertureSize=3)
lines = cv2.HoughLinesP(edges, 1, np.pi/180, 50, minLineLength=width*0.3, maxLineGap=20)
y_coords = []
if lines is not None:
for line in lines:
x1, y1, x2, y2 = line[0]
if abs(y1 - y2) < 3:
y_coords.append(y1)
y_coords = sorted(list(set(y_coords)))
distinct_y = []
if y_coords:
distinct_y.append(y_coords[0])
for y in y_coords[1:]:
if y - distinct_y[-1] > 25: # Row height is ~34
distinct_y.append(y)
print(f"Refined distinct horizontal lines: {distinct_y}")
# We need Header + 7 rows.
# If distinct_y[0] is top of header:
# Header: [0, 1]
# Feb 1: [1, 2]
# Feb 2: [2, 3]
# Feb 3: [3, 4]
# Feb 4: [4, 5]
# Feb 5: [5, 6]
# Feb 6: [6, 7]
# Feb 7: [7, 8]
# So we need up to distinct_y[8].
# Wait, if Feb 5 was the last one, maybe I need up to distinct_y[10]?
# Let's just crop a larger area to be sure and then refine if needed.
# Or better, just find all rows and take the first 8 (header + 7 data).
# Let's use a more robust way to find row separators: horizontal projections of edges.
h_proj = np.sum(edges, axis=1)
# peaks in h_proj are lines
from scipy.signal import find_peaks
peaks, _ = find_peaks(h_proj, height=1000, distance=25)
print(f"Peaks: {peaks}")
# If the first peak is the top border:
# peak 0: top border
# peak 1: bottom of header
# peak 2: bottom of Feb 1
# ...
# peak 8: bottom of Feb 7
if len(peaks) >= 9:
final_y = peaks[8] + 2
crop_img = img[peaks[0]:final_y, :, :]
cv2.imwrite('first_week_refined.png', crop_img)
else:
# fallback
cv2.imwrite('first_week_refined.png', img[0:400, :, :])
[np.int32(13), np.int32(60), np.int32(1327)]
Refined distinct horizontal lines: [np.int32(13), np.int32(60), np.int32(1327)]
Peaks: [ 13 60 87 120 165 212 256 301 346 392 437 482 528 573
618 664 709 754 800 845 893 936 981 1027 1072 1117 1163 1208
1253 1299 1329]
# Recalculate and crop.
# Let's see the peaks list from the previous run.
# Refined distinct horizontal lines: [9, 43, 77, 111, 145, 179, 213, 247, 281, 315, 349, ...]
# If [0]=9, [1]=43, [2]=77, [3]=111, [4]=145, [5]=179, [6]=213, [7]=247, [8]=281, [9]=315
# Row 1 (Feb 1): 43 to 77
# Row 2 (Feb 2): 77 to 111
# Row 3 (Feb 3): 111 to 145
# Row 4 (Feb 4): 145 to 179
# Row 5 (Feb 5): 179 to 213
# Row 6 (Feb 6): 213 to 247
# Row 7 (Feb 7): 247 to 281
# Wait, if row 7 is 247 to 281, then peaks[8]=281 should be the bottom of Feb 7.
# Let's re-examine first_week_refined.png.
# It ends at Feb 6. Oh, Feb 1 is the second row of data? No.
# Header is row 0. Feb 1 is row 1.
# Row 0: 9 to 43 (Header)
# Row 1: 43 to 77 (Feb 1)
# Row 2: 77 to 111 (Feb 2)
# Row 3: 111 to 145 (Feb 3)
# Row 4: 145 to 179 (Feb 4)
# Row 5: 179 to 213 (Feb 5)
# Row 6: 213 to 247 (Feb 6)
# Row 7: 247 to 281 (Feb 7)
# If first_week_refined.png used peaks[8], it should have included up to 281.
# Let's check why Feb 7 might be missing.
# Ah, maybe Row 1 starts from index 1.
# Let's just crop up to peaks[9] or peaks[8]+34.
# Let's use peaks[9] to be safe.
final_y = peaks[9] + 2 if len(peaks) > 9 else 320
crop_img = img[peaks[0]-2:final_y, :, :]
cv2.imwrite('feb_first_week.png', crop_img)
None
# Final crop up to peaks[8]
final_y = peaks[8] + 2
crop_img = img[peaks[0]:final_y, :, :]
cv2.imwrite('february_first_week.png', crop_img)
None
# Load feb_first_week.png and trim it a bit
img_fw = cv2.imread('feb_first_week.png')
# The row height is about 34.
# Feb 8 was just starting.
h_fw, w_fw, _ = img_fw.shape
# Crop the last 30 pixels or so to remove Feb 8
final_crop = img_fw[0:h_fw-34, :, :]
cv2.imwrite('february_first_week_final.png', final_crop)
None
2月の第1週(2月1日から2月7日まで)を切り取った画像を作成しました。以下のリンクから確認できます。
[2月第1週の切り取り画像 (february_first_week_final.png)](sandbox:/mnt/data/february_first_week_final.png)
指示に従った画像を切り取ってくれました!
ビジュアル数学
製品Aと製品Bの折れ線グラフを作成して
ご提供いただいた画像から製品Aと製品Bの売上データを抽出し、折れ線グラフを作成しました。
**グラフの概要:**
- **青色の線:** 製品Aの売上推移
- **オレンジ色の線:** 製品Bの売上推移
- **横軸:** 日付(2026年2月1日〜2月28日)
- **縦軸:** 売上金額(単位:円)
全体的に製品Aの売上が製品Bを上回っており、両製品ともに2月中旬から下旬にかけて売上が大きく変動している様子が確認できます。特に2月21日には製品Aが最高売上(220,000円)を記録しています。
生成されたコードを見ると、pandas や matplotlib を使ってグラフを作成していました!
画像アノテーション
金曜日の行のみ赤枠で囲った画像を出力して
生成されたコード
import cv2
import numpy as np
# 画像の読み込み
img = cv2.imread('input_file_0.png')
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# 金曜日の「金」ラベルの背景色(緑色)を特定するためのHSV範囲を設定
# 抽出した色に基づき、色相(H)が約78、彩度(S)が約124、明度(V)が約203付近をターゲットにします
lower_green = np.array([73, 100, 150])
upper_green = np.array([83, 255, 255])
# マスクを作成して「金」のラベル部分を抽出
mask = cv2.inRange(hsv, lower_green, upper_green)
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
output_img = img.copy()
for cnt in contours:
area = cv2.contourArea(cnt)
# ラベルのサイズ(面積)と位置(曜日列のx座標範囲)でフィルタリング
if 500 < area < 3000:
x, y, w, h = cv2.boundingRect(cnt)
if 120 < x < 200:
# 見つかった金曜日の行全体を赤枠で囲む
row_y1 = y - 8
row_y2 = y + h + 8
cv2.rectangle(output_img, (5, row_y1), (img.shape[1] - 5, row_y2), (0, 0, 255), 3)
# 結果の保存
cv2.imwrite('output.png', output_img)
金曜日の行のみアノテーションを付けてくれました!
おわりに
Gemini 3 Flash の Agentic Vision を試してみました。
視覚的な推論とコード実行により、「細かい部分を拡大して確認する」「表データをグラフに直す」といった工程を、モデル自らが判断して実行している様子を確認できました。
これまで OCR や画像解析で苦労していた、あと一歩届かなかった精度を Agentic Vision で改善できる可能性を感じました!




