Qwen 2.5 VLをGroundingタスクでファインチューニングする際の注意点まとめ

Posted at 2025-07-12

はじめに

Qwen 2.5 VLはAlibabaが開発した強力なVision-Language Model（VLM）です。画像や動画の理解、物体のローカライズ（Grounding）、ドキュメント解析など多様なタスクに対応しています。本記事では、Groundingタスク（画像内の物体をバウンディングボックスで特定するタスク）におけるファインチューニングの注意点や、データ前処理・Augmentationのポイントをまとめます。

1. Groundingタスクのデータ前処理の注意点

1.1 バウンディングボックス（bbox）座標の変換

Qwen 2.5 VLでは、画像のリサイズ後の絶対座標でbboxを指定する必要があります。
元データが ([x_1, y_1, w, h]) 形式の場合、まず ([x_1, y_1, x_2, y_2]) へ変換し、さらに公式のリサイズ関数でスケーリングします。

公式のbbox変換ノートブック:
process_bbox.ipynb

公式のリサイズ関数（Python実装例）

import math

def smart_resize(
    height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
):
    """
    1. 高さ・幅がfactorの倍数
    2. ピクセル数が[min_pixels, max_pixels]内
    3. アスペクト比維持
    """
    if height < factor or width < factor:
        raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    elif max(height, width) / min(height, width) > 200:
        raise ValueError(
            f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
        )
    h_bar = round(height / factor) * factor
    w_bar = round(width / factor) * factor
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor
    return h_bar, w_bar

def convert_to_qwen25vl_format(bbox, orig_height, orig_width, factor=28, min_pixels=56*56, max_pixels=14*14*4*1280):
    new_height, new_width = smart_resize(orig_height, orig_width, factor, min_pixels, max_pixels)
    scale_w = new_width / orig_width
    scale_h = new_height / orig_height

    x1, y1, x2, y2 = bbox
    x1_new = round(x1 * scale_w)
    y1_new = round(y1 * scale_h)
    x2_new = round(x2 * scale_w)
    y2_new = round(y2 * scale_h)

    x1_new = max(0, min(x1_new, new_width - 1))
    y1_new = max(0, min(y1_new, new_height - 1))
    x2_new = max(0, min(x2_new, new_width - 1))
    y2_new = max(0, min(y2_new, new_height - 1))

    return [x1_new, y1_new, x2_new, y2_new]

ポイント:

Qwen 2.5 VLのVision Encoderは入力画像を自動でリサイズします。
bboxも同じスケーリングを適用しないと、推論時に座標がズレます。
公式の変換手順を必ず踏みましょう。
公式Configでは非常に大きい画像サイズ（3584×3584）に対応しています。
preprocessor_config.json

2. Augmentationに関して

データの前処理が28×28のスケールに合わされていることが分かれば、データのAugmentationも行えます。
例えばAlbumentationsなどを使い、DetectionデータをBbox含めてAugmentしてから、Divisorを28に設定してPaddingを行えば、上記と同じように正確なBboxで学習できます。

3. SFT用データセットのフォーマット例

Qwen2.5-VLのSFT（Supervised Fine-Tuning）用データセットは、以下のようなjsonl形式が推奨されています。

{
  "image": "demo/COCO_train2014_000000580957.jpg",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nLocate house in this image and output the bbox coordinates in JSON format."
    },
    {
      "from": "gpt",
      "value": "{\n\"bbox_2d\": [135, 114, 1016, 672]\n}"
    }
  ]
}

公式のGrounding Exampleを参考にしましょう
Qwen2.5-VL-finetune/README.md

4. 参考文献・公式リソース

まとめ

Qwen 2.5 VLのGroundingタスクでは、画像リサイズ後の絶対座標でbboxを指定する必要あり
公式のリサイズ関数・変換手順を必ず使う
Augmentationもbboxスケーリングを考慮して実施
SFT用データセットは公式例に従う

ご質問・ご指摘はコメント欄まで！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up