Grounded Segment Anythingを動かしてみた

Posted at 2024-11-23

まえがき

自然言語でSegmentationをする方法はないものかと調べたところ、Grounded-Segment-Anythingというものを見つけました。

面白そうなので動かしてみようと思います。

この記事では

Grounded-Segment-Anythingを動かすための環境構築
Grounded-Segment-Anythingの動かし方

についてまとめようと思います。

Grounded-Segment-Anythingとは

自然言語で画像のSegmentationをするモデルのようです。
基本はGrounding DINOとSegment Anythingを上手く組み合わせて実現しているようです。
これ一つ動かすだけでGrounding DINOとSegment Anythingの両方を学べそうです。

Grounding DINOとは

テキストプロンプトを使って任意の物体を検出できる物体検出モデルです。

例えば、「青いリンゴ」を検出する場合、通常の物体検出モデルでは「青いリンゴ」を学習する必要があります。「リンゴ」を学習していたとしても、それが「青いリンゴ」なのか「赤いリンゴ」なのか区別できません。区別するためには「青いリンゴ」と「赤いリンゴ」を学習している必要があります。
Grounding DINOはそういった学習の必要がなく「青いリンゴ」と「赤いリンゴ」を区別して検出できるようです。

Segment Anthing(SAM)とは

Meta社が開発したゼロショットのセグメンテーションモデルです。

任意のプロンプトで物体のセグメンテーションができます。
任意のプロンプトとは前景/背景の点、BoundingBox、セグメンテーションマスク、テキストプロンプトなど、何をセグメンテーションしたいのかの情報を指します。
(テキストプロンプトも受け付けているならSAMだけで充分では…？)

環境構築

Dockerによる環境構築の仕組みが提供されているようですが、makeが必要だったりと面倒なので、自前でDockerコンテナを作成して環境を作ります。

使用したDockerイメージ
- nvidia/cuda:12.6.2-cudnn-devel-ubuntu20.04

nvccが必要になるのでruntimeの方は使えませんでした。

pythonとpytorchのインストール

$ apt update
$ apt upgrade -y
$ apt install -y python3 python3-pip
$ pip3 install torch torchvision torchaudio

モデルのダウンロードに必要なのでwgetもインストールしておきます。

$ apt install -y wget

opencvを動かすためのおまじない

$ apt install -y libgl1-mesa-dev libglib2.0-0

Grounded Segment Anythingのclone

$ git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git
$ cd Grounded-Segment-Anything

環境変数の設定
```
$ export AM_I_DOCKER=False
$ export BUILD_WITH_CUDA=True
$ export CUDA_HOME=/usr/local/cuda/
```
CUDA_HOMEは自分の環境に合わせて書き換える必要があります。nvccの場所を調べて設定してください。
```
$ which nvcc
/usr/local/cuda/bin/nvcc
```
Segment Anythingのインストール
```
$ pip3 install -e segment_anything
```

Grounding DINOのインストール

$ pip3 install --no-build-isolation -e GroundingDINO

Diffusersのインストール

$ pip3 install --upgrade diffusers[torch]

その他必要なパッケージのインストール

$ pip3 install opencv-python pycocotools matplotlib onnxruntime onnx ipykernel

Grounded Segment Anythingを動かす

GroundingDinoを動かす

学習済みのモデルをダウンロードする

$ cd Grounded-Segment-Anything
$ wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

デモスクリプトを実行する

$ python3 grounding_dino_demo.py

僕の場合このようなエラーが出ました。

Traceback (most recent call last):
  File "grounding_dino_demo.py", line 30, in <module>
    annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
  File "/workspace/Grounded-Segment-Anything/GroundingDINO/groundingdino/util/inference.py", line 102, in annotate
    annotated_frame = box_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
  File "/usr/local/lib/python3.8/dist-packages/supervision/utils/conversion.py", line 23, in wrapper
    return annotate_func(self, scene, *args, **kwargs)
TypeError: annotate() got an unexpected keyword argument 'labels'

調べるとsupervisionのバージョンを変更すれが解決するようです。
https://github.com/IDEA-Research/Grounded-Segment-Anything/issues/515
ということで変更します。

$ pip3 install supervision==0.21.0

再度デモを実行するとannotated_image.jpgが出力されました。

以下のように変更して白馬だけ検出できるか試してみます。

grounding_dino_demo.py

# TEXT_PROMPT = "Horse. Clouds. Grasses. Sky. Hill."
TEXT_PROMPT = "White Horse."

白馬以外の馬も検出していますが、スコアは白馬が一番高くなっています。
以下のように閾値も変更してみます。

python:grounding_dino_demo.py

# BOX_TRESHOLD = 0.35
BOX_TRESHOLD = 0.69

白馬だけ検出出来ました！

Grounded-SAMを動かす

学習済みモデルをダウンロードする

$ cd Grounded-Segment-Anything
$ wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

デモスクリプトを実行する

$ python3 grounded_sam_demo.py \
  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --grounded_checkpoint groundingdino_swint_ogc.pth \
  --sam_checkpoint sam_vit_h_4b8939.pth \
  --input_image assets/demo1.jpg \
  --output_dir "outputs" \
  --box_threshold 0.3 \
  --text_threshold 0.25 \
  --text_prompt "bear" \
  --device "cuda"

outputsディレクトリが作成され、その中に結果が出力されます。

先ほどと同じように白馬だけ検出してみます。

$ python3 grounded_sam_demo.py \
  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --grounded_checkpoint groundingdino_swint_ogc.pth \
  --sam_checkpoint sam_vit_h_4b8939.pth \
  --input_image assets/demo7.jpg \
  --output_dir "outputs" \
  --box_threshold 0.69 \
  --text_threshold 0.25 \
  --text_prompt "White Horse." \
  --device "cuda"

問題なく白馬だけセグメンテーション出来ました！

Grounded-SAM with Inpaintingを動かす

※ 必要な学習済みモデルはダウンロード済みだと思うので省略します。

python3 grounded_sam_inpainting_demo.py \
  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --grounded_checkpoint groundingdino_swint_ogc.pth \
  --sam_checkpoint sam_vit_h_4b8939.pth \
  --input_image assets/inpaint_demo.jpg \
  --output_dir "outputs" \
  --box_threshold 0.3 \
  --text_threshold 0.25 \
  --det_prompt "bench" \
  --inpaint_prompt "A sofa, high quality, detailed" \
  --device "cuda"

Grounded-SAMと同様にoutputsディレクトリが作成され、その中に結果が出力されます。

入力	GroundingDinoの出力	Inpainting結果

ベンチがソファに変わりました。
犬の尻尾が増えているのはご愛敬。

白馬を普通の馬に変えてみます。

python3 grounded_sam_inpainting_demo.py \
  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --grounded_checkpoint groundingdino_swint_ogc.pth \
  --sam_checkpoint sam_vit_h_4b8939.pth \
  --input_image assets/inpaint_demo.jpg \
  --output_dir "outputs" \
  --box_threshold 0.69 \
  --text_threshold 0.25 \
  --det_prompt "White Horse." \
  --inpaint_prompt "A Horse, high quality, detailed" \
  --device "cuda"

入力	GroundingDinoの出力	Inpainting結果

周りと同じような見た目になりましたが、ちょっと崩れてしまっています。
diffusersのInpainting機能を使っているようなのでそのあたりを調べればもう少しましな結果を作れるかもしれません。

おわりに

本記事では

Grounded Segment Anythingの概要
Grounded Segment Anythingの動作環境の構築
Grounded Setment ANythingの動かし方

について紹介しました。

最後のInpaintingについてはもう少し品質を上げる方法を探したいと思います。
SAMはテキストプロンプトも受け付けているようなので、そのあたりも調査してみたいです。

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up