showlab/ShowUIを使ってみた

Last updated at 2025-06-07Posted at 2025-06-01

概要

showlab/ShowUIを導入して「自然言語(llm)でPCを操作する」というのを試してみました
普段使用しているpcで行うのは危険なためmini pcを購入してUbuntu Desktopを入れて行いました。

pcはもう少しいいものを購入すればよかったと後悔しています。
ブラウザでchat gptを使用するだけでcpuの使用率が高くなり出力もかなり遅いです

図chat gptでaiのテキストを表示中のcpu使用率

(Qiiaの記事を書いているだけでもカクカクします)

簡単な説明

下記の仕組みになっています。

ディスプレイサーバ(X11)という仕組みを利用することでコマンドでpcの操作が可能
aiに画面のキャプチャを送信してどのコマンドを一考するか返してもらう
プログラムでコマンドを実行する

開発手順

pythonの設定はpyenv、ライブラリ管理はpipenvで管理する想定です。もし導入しない場合は適宜変更してください。

1. ubuntuのインストール

1-1. minipcにubuntuをインストール

mini pcにubuntuを入れます。

メモ

try or install ubuntuを選択する
wellcome to ubuntuでインストーラーのアップデートをスキップ
ubuntuをインストールを選択する

1-2. ディスプレイサーバの変更

プログラムでpcを操作するために、ディスプレイサーバをX11にする必要あります
下記の手順で設定します

ubuntuの再起動
自分の名前をクリック
右下の歯車をクリックして「Ubuntu on Xorg」を選択

ターミナルで回を実行してX11が表示されればOK

echo $XDG_SESSION_TYPE

1-2 キーマッピングの設定

この変換は好みで!

2. pythonの設定

2-1.

sudo apt update
sudo apt install git

2-2. pythonの設定

ここでは好きな設定でいいです
下記はpyenvを入れる方法です。参考にしてください

コマンド実行

sudo apt install -y \
curl \
build-essential \
libssl-dev \
zlib1g-dev \
libreadline-dev \
libbz2-dev \
libsqlite3-dev \
libncurses5-dev \
libffi-dev \
liblzma-dev \
libffi-dev \
libbz2-dev \
libssl-dev \
libreadline-dev \
libsqlite3-dev \
libncurses5-dev \
libncursesw5-dev \
zlib1g-dev \
liblzma-dev \
libgdbm-dev \
tk-dev

git clone https://github.com/anyenv/anyenv ~/.anyenv

~/.bashrcに追記

export ANYENV_ROOT="$HOME/.anyenv"
export PATH="$ANYENV_ROOT/bin:$PATH"
eval "$(anyenv init -)"

コマンド実行

anyenv install --init
source ~/.bashrc
mkdir -p $(anyenv root)/plugins
git clone https://github.com/znz/anyenv-update.git $(anyenv root)/plugins/anyenv-update
anyenv update
anyenv install pyenv
source ~/.bashrc
pyenv install 3.12.0

3. showlab/ShowUIの導入

3-1.

pipenv使用している場合
~/.bashrcに追記

export PIPENV_VENV_IN_PROJECT=1

3-2.

cd 好きなdir
git clone https://github.com/showlab/ShowUI.git
cd ShowUI

pyenvを使用している場合
コマンドを実行

pyenv local 3.12.0
pip install --upgrade pip
pip install pipenv
pipenv install -r requirements.txt

pyenv、pipenvを使用している場合
.gitignoreに追加

.venv
.python-version

4. コード修正、自動化

デフォルトではlocalのaiモデルを使用する設定になっています。
これをopen aiのapiを使用するように変更します

4-1. python-dotenvの導入

pipenvを使用している場合
コマンドの実行

pipenv install python-dotenv

.gitignoreに追加

.env

4-2. envの作成

.env

OPENAI_API_KEY=<open aiのapiキー>

4.3. 画像保存用のフォルダの作成

mkdir images

.gitignore

*
!.gitignore

4-4. 自動化の実装

openaiを利用
画像ファイルはimagesフォルダに保存
aiのレスポンスを元にpcを操作

app.pyを書き換え

import os
import base64
import json
import subprocess
import time
from datetime import datetime
import gradio as gr
import numpy as np
from PIL import Image, ImageDraw
from dotenv import load_dotenv

# ─────────────────────────────────────────────────────────
# Step 0: 「images」フォルダがなければ作成する
# ─────────────────────────────────────────────────────────
ROOT_DIR = os.path.dirname(__file__)
IMAGES_DIR = os.path.join(ROOT_DIR, "images")
os.makedirs(IMAGES_DIR, exist_ok=True)

# ─────────────────────────────────────────────────────────
# Step 1: .env を読み込む (API キー利用時に必要)
# ─────────────────────────────────────────────────────────
load_dotenv(dotenv_path=os.path.join(ROOT_DIR, ".env"))

# ShowUIProvider を api.py からインポート
from api import ShowUIProvider

# 定数や説明文（必要に応じて書き換えてください）
DESCRIPTION = "[ShowUI Demo](https://huggingface.co/showlab/ShowUI-2B)"
_SYSTEM = (
    "Based on the screenshot of the page, I give a text description "
    "and you give its corresponding location. The coordinate represents "
    "a clickable location [x, y] for an element, which is a relative coordinate "
    "on the screenshot, scaled from 0 to 1."
)

# インスタンスをグローバルで作成（キャッシュ用）
provider = ShowUIProvider()


def draw_point(image_input, pixel_point=None, radius=5):
    """
    image_input: str (ファイルパス) または numpy array
    pixel_point: (x, y) タプル (ピクセル座標)
    返り値: PIL.Image オブジェクト
    """
    if isinstance(image_input, str):
        image = Image.open(image_input)
    else:
        image = Image.fromarray(np.uint8(image_input))

    if pixel_point:
        x, y = pixel_point
        ImageDraw.Draw(image).ellipse((x - radius, y - radius, x + radius, y + radius), fill='red')
    return image


def array_to_image_path(image_array):
    """
    Gradio から渡ってくる numpy array を images/ フォルダに保存し、
    ファイルパスを返す
    """
    if image_array is None:
        raise ValueError("No image provided. Please upload an image before submitting.")
    img = Image.fromarray(np.uint8(image_array))
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"input_{timestamp}.png"
    image_path = os.path.join(IMAGES_DIR, filename)
    img.save(image_path)
    return image_path


def execute_actions_from_json(raw_json_text):
    """
    AI から返ってきた生の JSON を受け取り、
    アクション順に xdotool を呼び出して実際に操作を行う。
    返り値: 成功メッセージ or エラーメッセージ
    """
    try:
        data = json.loads(raw_json_text)
        actions = data.get("actions", [])
    except Exception as e:
        return f"JSON パースエラー: {e}"

    # 画面解像度を取得 (xdotool getdisplaygeometry で取得)
    try:
        geom = subprocess.check_output(["xdotool", "getdisplaygeometry"]).decode().strip()
        screen_w, screen_h = map(int, geom.split())
    except Exception as e:
        return f"画面解像度取得エラー: {e}"

    for action in actions:
        a_type = action.get("type")
        if a_type == "CLICK":
            xr = action.get("x_ratio", 0.0)
            yr = action.get("y_ratio", 0.0)
            # 正規化座標 → 絶対ピクセル座標に変換
            x = int(xr * screen_w)
            y = int(yr * screen_h)
            print(f"[Action] CLICK → 移動先 ({x},{y})")
            # マウス移動＆クリック
            subprocess.run(["xdotool", "mousemove", str(x), str(y)])
            time.sleep(0.05)
            subprocess.run(["xdotool", "click", "1"])
            time.sleep(0.2)

        elif a_type == "TYPE":
            text = action.get("text", "")
            print(f"[Action] TYPE → 「{text}」を入力")
            # 文字列をタイプ
            subprocess.run(["xdotool", "type", "--delay", "5", text])
            time.sleep(0.2)

        elif a_type == "KEY":
            key_name = action.get("key", "")
            print(f"[Action] KEY → {key_name} キーを押下")
            subprocess.run(["xdotool", "key", key_name])
            time.sleep(0.2)

        else:
            print(f"[Action] 未知のアクションタイプ: {action}")
            # 不明なアクションは無視するか、ログに残す
            continue

    return "Actions execution completed."


def run_showui(image_array, query):
    """
    画像 (numpy array) とクエリを受け取り、
    ShowUIProvider.call() を呼び出して生の JSON と座標を取得し、
    さらに xdotool による実操作を行う。
    戻り値:
      ・result_img: クリック位置に赤点をつけた PIL.Image (デバッグ用)
      ・coord_text: 座標文字列（例: "123, 456"）
      ・exec_msg:    xdotool 操作後の結果メッセージ
    """
    # 1) numpy array を images/ に保存してパスを取得
    image_path = array_to_image_path(image_array)

    # 2) GPT-4V (OpenAI) に問い合わせて、
    #    raw_json（生の JSON）、pixel_xy（最初の CLICK 座標のピクセル）、actions（リスト）を取得
    raw_json, pixel_xy, actions = provider.call(query, image_path)

    if pixel_xy is None and not actions:
        # CLICK も TYPE も KEY も何も得られなかった場合
        return None, "クリック座標が得られませんでした。", ""

    # 3) 取得した座標に赤い点を描画して画像を返す（あくまで可視化用デバッグ）
    result_img = None
    coord_text = ""
    if pixel_xy is not None:
        result_img = draw_point(image_path, pixel_xy, radius=10)
        coord_text = f"{int(pixel_xy[0])}, {int(pixel_xy[1])}"
    else:
        # CLICK がなかった場合は、もとの画像をそのまま返す
        result_img = Image.open(image_path)
        coord_text = "CLICK アクションは含まれていません。"

    # 4) アクション一覧を実行する
    exec_msg = execute_actions_from_json(raw_json)

    return result_img, coord_text, exec_msg


# Function to record votes (そのまま流用)
def record_vote(vote_type, image_path, query, action_generated):
    vote_data = {
        "vote_type": vote_type,
        "image_path": image_path,
        "query": query,
        "action_generated": action_generated,
        "timestamp": datetime.now().isoformat()
    }
    with open("votes.json", "a") as f:
        f.write(json.dumps(vote_data) + "\n")
    return f"Your {vote_type} has been recorded. Thank you!"


def handle_vote(vote_type, image_path, query, action_generated):
    if image_path is None:
        return "No image uploaded. Please upload an image before voting."
    return record_vote(vote_type, image_path, query, action_generated)


# Load logo など（必要な場合のみ）
logo_path = os.path.join(ROOT_DIR, "assets", "showui.jpg")
base64_image = ""
if os.path.exists(logo_path):
    with open(logo_path, "rb") as image_file:
        base64_image = base64.b64encode(image_file.read()).decode("utf-8")


def build_demo(embed_mode=False, concurrency_count=1):
    """
    Gradio UI の定義。変更点は、run_showui() を呼び出す部分のみです。
    """
    with gr.Blocks(title="ShowUI Demo (OpenAI + xdotool 実行)", theme=gr.themes.Default()) as demo:
        state_image_path = gr.State(value=None)

        if not embed_mode:
            # ロゴや説明文を表示
            html_content = f"""
            <div style="text-align: center; margin-bottom: 20px;">
              <div style="display: flex; justify-content: center;">
                <img src="data:image/png;base64,{base64_image}"
                     alt="ShowUI" width="320" style="margin-bottom: 10px;"/>
              </div>
              <p>ShowUI is a lightweight vision-language-action model for GUI agents (OpenAI-backed).</p>
              <div style="display: flex; justify-content: center; gap: 15px; font-size: 20px;">
                <a href="https://huggingface.co/showlab/ShowUI-2B" target="_blank">
                  <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ShowUI--2B-blue" alt="model"/>
                </a>
                <a href="https://arxiv.org/abs/2411.17465" target="_blank">
                  <img src="https://img.shields.io/badge/arXiv%20paper-2411.17465-b31b1b.svg" alt="arXiv"/>
                </a>
                <a href="https://github.com/showlab/ShowUI" target="_blank">
                  <img src="https://img.shields.io/badge/GitHub-ShowUI-black" alt="GitHub"/>
                </a>
              </div>
            </div>
            """
            gr.HTML(html_content)

        with gr.Row():
            with gr.Column(scale=3):
                # 入力コンポーネント：画像とテキスト
                imagebox  = gr.Image(type="numpy", label="Input Screenshot")
                textbox   = gr.Textbox(
                    show_label=True,
                    placeholder="Enter a query (e.g., 'Click Nahant')",
                    label="Query",
                )
                submit_btn = gr.Button(value="Submit", variant="primary")

                # サンプル例
                gr.Examples(
                    examples=[
                        ["./examples/app_store.png", "Download Kindle."],
                        ["./examples/ios_setting.png", "Turn off Do not disturb."],
                        ["./examples/apple_music.png", "Star to favorite."],
                        ["./examples/map.png", "Boston."],
                        ["./examples/wallet.png", "Scan a QR code."],
                        ["./examples/word.png", "More shapes."],
                        ["./examples/web_shopping.png", "Proceed to checkout."],
                        ["./examples/web_forum.png", "Post my comment."],
                        ["./examples/safari_google.png", "Click on search bar."],
                    ],
                    inputs=[imagebox, textbox],
                    examples_per_page=3
                )

            with gr.Column(scale=8):
                # 出力コンポーネント：結果画像と座標と実行メッセージ
                output_img    = gr.Image(type="pil", label="Output Image")
                gr.HTML(
                    """
                    <p><strong>Note:</strong> The <span style="color: red;">red point</span> on the output image represents the predicted clickable coordinates.</p>
                    """
                )
                output_coords = gr.Textbox(label="Clickable Coordinates")
                output_msg    = gr.Textbox(label="Execution Log/Message")

                # 投票・再生成・クリアボタン
                with gr.Row(elem_id="action-buttons", equal_height=True):
                    vote_btn       = gr.Button(value="👍 Vote",   variant="secondary")
                    downvote_btn   = gr.Button(value="👎 Downvote", variant="secondary")
                    flag_btn       = gr.Button(value="🚩 Flag",    variant="secondary")
                    regenerate_btn = gr.Button(value="🔄 Regenerate", variant="secondary")
                    clear_btn      = gr.Button(value="🗑️ Clear",   interactive=True)

            # 「Submit」ボタンを押したときの処理
            def on_submit(image, query):
                if image is None:
                    raise ValueError("No image provided. Please upload an image before submitting.")
                # 1) 画像を images/ に保存してパスを取得
                img_path = array_to_image_path(image)
                # 2) run_showui() で (結果画像, 座標, 実行メッセージ) を得る
                img_out, coords, exec_msg = run_showui(image, query)
                # 3) Submit のときは (画像, 座標, 実行メッセージ, 画像パス) を返す
                return img_out, coords, exec_msg, img_path

            submit_btn.click(
                on_submit,
                inputs=[imagebox, textbox],
                outputs=[output_img, output_coords, output_msg, state_image_path],
            )

            # 「Clear」ボタン
            clear_btn.click(
                lambda: (None, None, None, None, None),
                inputs=None,
                outputs=[imagebox, textbox, output_img, output_coords, output_msg, state_image_path],
                queue=False
            )

            # 「Regenerate」ボタン
            regenerate_btn.click(
                lambda image, query, state_image_path: run_showui(image, query),
                inputs=[imagebox, textbox, state_image_path],
                outputs=[output_img, output_coords, output_msg],
            )

            # 「Vote」「Downvote」「Flag」ボタン
            vote_btn.click(
                lambda image_path, query, action_generated: handle_vote(
                    "upvote", image_path, query, action_generated
                ),
                inputs=[state_image_path, textbox, output_coords],
                outputs=[],
                queue=False
            )
            downvote_btn.click(
                lambda image_path, query, action_generated: handle_vote(
                    "downvote", image_path, query, action_generated
                ),
                inputs=[state_image_path, textbox, output_coords],
                outputs=[],
                queue=False
            )
            flag_btn.click(
                lambda image_path, query, action_generated: handle_vote(
                    "flag", image_path, query, action_generated
                ),
                inputs=[state_image_path, textbox, output_coords],
                outputs=[],
                queue=False
            )

    return demo


if __name__ == "__main__":
    demo = build_demo(embed_mode=False)
    demo.queue(api_open=False).launch(
        server_name="0.0.0.0",
        server_port=7860,
        ssr_mode=False,
        debug=True,
    )

api.pyを書き換え

import os
import json
import ast
import base64
import subprocess
from PIL import Image
from dotenv import load_dotenv

# ─────────────────────────────────────────────────────────
# Step 0: 「images」フォルダがなければ作成する
# ─────────────────────────────────────────────────────────
ROOT_DIR = os.path.dirname(__file__)
IMAGES_DIR = os.path.join(ROOT_DIR, "images")
os.makedirs(IMAGES_DIR, exist_ok=True)

# ─────────────────────────────────────────────────────────
# Step 1: .env から OpenAI API キーを読み込む
# ─────────────────────────────────────────────────────────
load_dotenv(dotenv_path=os.path.join(ROOT_DIR, ".env"))
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("ERROR: 環境変数 OPENAI_API_KEY が設定されていません。`.env` を確認してください。")

import openai
openai.api_key = OPENAI_API_KEY

# 利用する GPT-4V モデル名を定義（必要に応じて変更してください）
AI_MODEL = "gpt-4.1"  # 例として GPT-4V 系を指定。お使いのプランに合わせて変更してください。


class ShowUIProvider:
    """
    ShowUIProvider は、画像とクエリを受け取って
    GPT-4V に Base64 埋め込み形式で問い合わせを行い、
    返ってきた JSON をそのまま返却するとともに、
    CLICK/TYPE/KEY のアクション一覧をパースして返します。
    """

    def __init__(self):
        print("[ShowUIProvider] Initialized – ready to call OpenAI API.")

    def extract_actions(self, response_json):
        """
        GPT-4V から返ってきた JSON をパースし、
        `"actions"` キーのリストをそのまま Python のリストとして返す。
        失敗時は空リストを返す。
        """
        try:
            data = json.loads(response_json)
            return data.get("actions", [])
        except Exception:
            return []

    def extract_norm_point(self, response_json, image_path=None):
        """
        GPT-4V から返ってきた JSON をパースし、
        最初の `"type":"CLICK"` アクションの (x_ratio, y_ratio) を返す。
        image_path を指定すると、ピクセル座標 (x,y) で返す。
        """
        actions = self.extract_actions(response_json)
        for action in actions:
            if action.get("type") == "CLICK":
                xr = action.get("x_ratio", 0.0)
                yr = action.get("y_ratio", 0.0)
                if image_path:
                    img = Image.open(image_path)
                    return (xr * img.width, yr * img.height)
                return (xr, yr)
        return None

    def call(self, prompt, image_data):
        """
        prompt:     ユーザーの自然言語指示（日本語可）
        image_data: str (ファイルパス) または numpy array
        戻り値:
          ・raw_json: GPT-4V から返ってきた生の JSON 文字列
          ・pixel_xy: (x, y) のピクセル座標タプル、もしくは None
          ・actions:  CLICK/TYPE/KEY のアクション辞書リスト
        """

        # ─────────────────────────────────────────────────────
        # Step 2: 画像を「images/」フォルダ内に一時保存し、Base64 にエンコード
        # ─────────────────────────────────────────────────────
        if isinstance(image_data, str):
            # すでにファイルパスが渡された場合はそのまま使う
            image_path = image_data
        else:
            # numpy array が渡された場合は images/ 内に保存
            img = Image.fromarray(image_data.astype("uint8"))
            timestamp = f"{os.getpid()}_{int(os.path.getmtime(__file__))}"
            filename = f"showui_{timestamp}.png"
            image_path = os.path.join(IMAGES_DIR, filename)
            img.save(image_path)

        # 画像をバイナリで読み込み → Base64 にエンコード
        with open(image_path, "rb") as f:
            raw_bytes = f.read()
        b64 = base64.b64encode(raw_bytes).decode("utf-8")
        b64_datauri = f"data:image/png;base64,{b64}"

        # ─────────────────────────────────────────────────────
        # Step 3: GPT-4V への問い合わせ用プロンプトを作成
        # ─────────────────────────────────────────────────────
        system_prompt = (
            "あなたは Ubuntu デスクトップの GUI を操作するエージェントです。"
            "以下の JSON 形式だけを返してください：\n"
            "{\n"
            "  \"actions\": [\n"
            "    {\"type\":\"CLICK\", \"x_ratio\":<0～1>, \"y_ratio\":<0～1>},\n"
            "    {\"type\":\"TYPE\",  \"text\":\"<入力する文字列>\"},\n"
            "    {\"type\":\"KEY\",   \"key\":\"<キー名>\"}\n"
            "  ]\n"
            "}\n"
            "余計な説明やコメントは一切含めないでください。"
        )
        user_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_datauri
                    }
                }
            ]
        }

        print("[ShowUIProvider] OpenAI に問い合わせを開始します…")
        response = openai.chat.completions.create(
            model=AI_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                user_message
            ],
            temperature=0.0
        )
        print("[ShowUIProvider] OpenAI から応答を受け取りました。")

        # ─────────────────────────────────────────────────────
        # Step 4: ChatCompletion オブジェクトから生の応答テキストを取得
        # ─────────────────────────────────────────────────────
        raw_text = response.choices[0].message.content.strip()
        print(f"[ShowUIProvider] raw_text: {raw_text}")

        # ─────────────────────────────────────────────────────
        # Step 5: JSON からアクション一覧を取得
        # ─────────────────────────────────────────────────────
        actions = self.extract_actions(raw_text)

        # ─────────────────────────────────────────────────────
        # Step 6: 正規化座標 (x_ratio, y_ratio) → ピクセル座標 (x,y) を取得
        #          （CLICK が含まれなければ None を返す）
        # ─────────────────────────────────────────────────────
        pixel_xy = self.extract_norm_point(raw_text, image_path)

        return raw_text, pixel_xy, actions


if __name__ == "__main__":
    showuiprovider = ShowUIProvider()
    # 例: images/examples/chrome.png が存在しているとする
    img_path = os.path.join(IMAGES_DIR, "examples", "chrome.png")
    query = "search box が表示されている位置をクリックしてください"
    raw_json, result, actions = showuiprovider.call(query, img_path)
    print(f"[Test] raw_json: {raw_json}")
    print(f"[Test] 返ってきたクリック座標 (ピクセル): {result}")
    print(f"[Test] アクション一覧: {actions}")

4-3. 起動

pipenv使用している場合

pipenv install gradio

pipenv run python app.py

ブラウザでhttp://0.0.0.0:7860にアクセス

画面が表示されて、なんかしらの画像をアップロードして、プロンプトを入力してsubmitを押すとpcが操作されていればOK

「ブラウザのタブを閉じて」と入力するとvscodeが閉じられました。
少しずれていますね。

感想

制度が結構悪くまだまだ改善の余地があるなと思いました。
ほとんどのソースコードを自前で実装したのですが、ここまで実装するならshowlab/ShowUIを使用せずstreamlitかなんかで実装したほうがいいのかなと思いました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up