Windows PC (SikuliX / ChatGPT Vision / 音声コマンド) でマウス＆キーボードをソフト制御する方法

Last updated at 2025-01-10Posted at 2025-01-10

はじめに

先の例では、別端末を操作するために Raspberry Pi Zero 2 W (USB HID) などを使うシナリオを取り上げました。しかし、まずは1台のWindows PCだけでローカルに画面をキャプチャし、自分自身のマウス・キーボードをソフト的に操縦してみたい、という要望も多いです。これは、物理的な外部デバイスや別PCを用意する前に開発・デバッグをスムーズに行う目的などで便利です。

本記事では、SikuliXやChatGPT Visionによる画面解析と、Google Cloud Speech-to-Textでの音声コマンドを取り入れ、自分自身のOS上のマウス＆キーボードを直接制御するサンプルをまとめます。Pi Zero不要で、すべて1台のPCで完結する構成です。

想定構成

Windows PC
- メイン/2ndモニタ等をSikuliXがキャプチャし、GUI要素を解析
- ChatGPT Vision（クラウドAPI）に画像を送ってボタン座標を取得したり、多言語UIを解析
- Google Cloud Speech-to-Text で「コンピュータログイン」など音声コマンドを受け取り、SikuliXスクリプトを呼び出す
- マウス＆キーボードをソフトウェアで直接注入（SikuliXのclick(), type() など）
ハードウェア構成
- 特に追加デバイス不要。Pi Zero 2 Wは使わない
- もし2ndモニタがあるなら、そちらを解析対象にすると本番を想定したテストがしやすい
- HDMIキャプチャデバイスを噛ませるより、OBSのソフトキャプチャやSikuliXマルチモニタ機能で済む場合が多い
自分自身への制御
- ローカルOSに対して“乗っ取り”のようにマウス移動・クリック・キー入力が行われる
- 誤操作で重要なウィンドウを閉じたりファイル操作をするリスクもあるので、テスト環境やユーザーアカウントを分けるなど安全策があると望ましい

ファイル構成の例

self_autopilot/
  ├─ sikuli_main.py         // SikuliX (Pythonモード)で動かすメインスクリプト
  ├─ voice_command.py        // 音声入力(Google STT)でコマンド解析
  ├─ analyze_chatgpt.py      // (任意) ChatGPT Vision連携
  ├─ requirements.txt        // Python依存パッケージ (pyaudio, google-cloud-speech, requestsなど)
  └─ ...

sikuli_main.py: メインロジック(音声コマンド→SikuliX操作→ChatGPT Vision解析→ローカルclick/type)
voice_command.py: Google Cloud Speech-to-Textストリーミングを扱うモジュール

1. Google Cloud Speech-to-Textで音声指令を検知 (voice_command.py)

# voice_command.py
"""
- pyaudioでマイク音声を取得
- Google Cloud Speech-to-Text (streaming_recognize)に送信
- ウェイクワード「コンピュータ」と、各種コマンド(ログイン/ログアウト/シャットダウンなど)を解析
"""
import os
import pyaudio
import queue
import threading
from google.cloud import speech

# 必要に応じて
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/gcp_credentials.json"

SUPPORTED_COMMANDS = {
    "ログイン": "LOGIN",
    "発進": "ENGAGE",
    "被害報告": "REPORT",
    "ログアウト": "LOGOUT",
    "シャットダウン": "SHUTDOWN"
}

class VoiceCommander:
    def __init__(self):
        self.client = speech.SpeechClient()
        self.config = speech.RecognitionConfig(
            encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
            sample_rate_hertz=16000,
            language_code="ja-JP"
        )
        self.streaming_config = speech.StreamingRecognitionConfig(
            config=self.config,
            interim_results=True
        )
        self._audio_queue = queue.Queue()
        self._stop_event = threading.Event()

    def _audio_generator(self):
        while not self._stop_event.is_set():
            chunk = self._audio_queue.get()
            if chunk is None:
                break
            yield speech.StreamingRecognizeRequest(audio_content=chunk)

    def start_listening(self, callback):
        """
        callback(command_key) が、音声解析確定時に呼ばれる
        """
        mic = pyaudio.PyAudio()
        stream = mic.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=16000,
            input=True,
            frames_per_buffer=1024,
            stream_callback=self._audio_callback
        )
        stream.start_stream()

        requests = self._audio_generator()
        responses = self.client.streaming_recognize(self.streaming_config, requests)

        def process_responses():
            for response in responses:
                if not response.results:
                    continue
                result = response.results[0]
                if not result.alternatives:
                    continue
                transcript = result.alternatives[0].transcript
                # is_final=True で確定
                if result.is_final:
                    print("[Voice] Final transcript:", transcript)
                    self._parse_command(transcript, callback)

        resp_thread = threading.Thread(target=process_responses)
        resp_thread.start()

    def _parse_command(self, transcript, callback):
        if WAKE_WORD in transcript:
            for jp_word, cmd_key in SUPPORTED_COMMANDS.items():
                if jp_word in transcript:
                    print(f"[Voice] Matched command: {jp_word} -> {cmd_key}")
                    callback(cmd_key)
                    break

    def _audio_callback(self, in_data, frame_count, time_info, status_flags):
        self._audio_queue.put(in_data)
        return (None, pyaudio.paContinue)

    def stop(self):
        self._stop_event.set()
        self._audio_queue.put(None)

ポイント

コマンドをどんどん追加するには SUPPORTED_COMMANDS に "他の日本語":"COMMAND_KEY" を増やすだけ
_parse_command() で “コンピュータ” が含まれているかも判定

2. ローカル操作のメインスクリプト (sikuli_main.py)

# sikuli_main.py (SikuliX Pythonモード)
"""
自分のPCを自動制御する例:
 1) 音声コマンド (コンピュータ ログインなど)
 2) ChatGPT Visionでボタン等解析 (任意)
 3) SikuliXの click() / type() などでローカルOSに入力注入
 4) 複数コマンド(ログイン/ログアウト/シャットダウン...)を容易に追加
"""
from sikuli import *
import time
import requests
import os
import sys
import threading

# Google Cloud Speech音声認識
import voice_command

CHATGPT_VISION_ENDPOINT = "https://api.openai.com/v1/images/analyze"
CHATGPT_VISION_API_KEY = os.environ.get("CHATGPT_API_KEY", "YOUR_API_KEY")

def send_to_chatgpt_vision(image_path, prompt_text):
    """
    画像+プロンプトをChatGPT Visionへ送る。
    JSON形式を想定。要パース強化する余地あり。
    """
    headers = {
        "Authorization": f"Bearer {CHATGPT_VISION_API_KEY}"
    }
    files = {"file": open(image_path, "rb")}
    data = {"prompt": prompt_text}
    resp = requests.post(CHATGPT_VISION_ENDPOINT, headers=headers, files=files, data=data)
    if resp.status_code == 200:
        return resp.json()
    else:
        print("[ChatGPT Vision Error]", resp.status_code, resp.text)
        return None

def gradual_move_mouse(target_x, target_y, steps=10):
    """
    段階的にマウスを動かす(ローカルOS)。
    Pi Zero不要でSikuliX内のローカルMouse.move()を使う。
    """
    current_loc = Env.getMouseLocation()
    start_x, start_y = current_loc.x, current_loc.y
    dx = (target_x - start_x) / float(steps)
    dy = (target_y - start_y) / float(steps)

    for i in range(steps):
        new_x = int(start_x + dx*(i+1))
        new_y = int(start_y + dy*(i+1))
        Mouse.move(Location(new_x, new_y))
        time.sleep(0.05)

def process_login_flow():
    """
    例: ChatGPT Vision解析で"login button"探してclick()
    """
    print("[Flow] Processing LOGIN scenario.")
    screen_img = capture(Screen())  # 1) ローカル画面キャプチャ(メインor2ndモニタ)
    image_path = str(screen_img)

    prompt = "Find the login button. Return coords as JSON: {\"x\":..., \"y\":...}."
    result = send_to_chatgpt_vision(image_path, prompt)
    if not result:
        print("[Flow] No result from ChatGPT Vision.")
        return

    coords = result.get("coords", {})
    x = coords.get("x", 250)
    y = coords.get("y", 400)
    print("[Flow] ChatGPT suggests coords:", x, y)

    # 2) 段階的マウス移動
    gradual_move_mouse(x, y, steps=20)

    # 3) click() はSikuliXがローカルOSにクリック入力
    click(Location(x,y))
    print("[Flow] Done clicking login button.")

def process_engage_flow():
    """
    "発進" => ENGAGE
    例: SikuliXパターンマッチ "engage_button.png" などを探す
    """
    print("[Flow] ENGAGE scenario.")
    try:
        region = Screen().find("engage_button.png")
        center_pt = region.getCenter()
        x, y = center_pt.x, center_pt.y
        gradual_move_mouse(x, y, steps=10)
        click(Location(x, y))
    except FindFailed:
        print("[Flow] engage_button.png not found or no target pattern.")

def process_report_flow():
    """
    "被害報告" => REPORT
    例: type() でテキストを入力 or ChatGPT Visionでsome "report button"
    """
    print("[Flow] REPORT scenario.")
    type("Damage report: minimal casualties.")
    # Or use ChatGPT Vision: parse screenshot -> find coordinate -> click

def process_logout_flow():
    """
    例: "ログアウト"コマンド => SikuliXパターンマッチ "logout_button.png"
    """
    print("[Flow] Processing LOGOUT scenario.")
    try:
        region = Screen().find("logout_button.png")  # 2ndモニタは Screen(1), etc
        center_pt = region.getCenter()
        x, y = center_pt.x, center_pt.y

        gradual_move_mouse(x, y, steps=10)
        click(Location(x, y))
    except FindFailed:
        print("[Flow] logout_button not found.")

def process_shutdown_flow():
    """
    "シャットダウン"コマンド => OS終了操作 etc.
    """
    print("[Flow] Processing SHUTDOWN scenario.")
    # 例: type("cmd"), type(Key.ENTER), ...
    # or type("shutdown /s") => Must handle carefully to avoid losing session
    pass

def voice_callback(cmd_key):
    """
    音声コマンドkey:
     - "LOGIN"
     - "ENGAGE"
     - "REPORT"
     - "LOGOUT"
     - "SHUTDOWN"
    """
    if cmd_key == "LOGIN":
        process_login_flow()
    elif cmd_key == "ENGAGE":
        process_engage_flow()
    elif cmd_key == "REPORT":
        process_report_flow()
    elif cmd_key == "LOGOUT":
        process_logout_flow()
    elif cmd_key == "SHUTDOWN":
        process_shutdown_flow()
    else:
        print("[Voice] Unknown command key:", cmd_key)

def main_loop():
    commander = voice_command.VoiceCommander()

    # 別スレッドで音声認識スタート
    voice_thread = threading.Thread(target=commander.start_listening, args=(voice_callback,))
    voice_thread.start()

    print("[Main] Voice recognition started. Commands: ログイン, 発進(ENGAGE), 被害報告, ログアウト, シャットダウン.")
    print("Say 'コンピュータ 発進' etc. Press Ctrl+C to stop.")

    try:
        while True:
            # ここで他のSikuliX操作や定時処理
            # e.g. find("some_popup.png") -> if found: click()
            time.sleep(3)
    except KeyboardInterrupt:
        print("[Main] Stopping voice recognition.")
        commander.stop()
        voice_thread.join()

def main():
    main_loop()

if __name__ == "__main__":
    main()

解説:

音声コマンドが "ログイン", "ログアウト", "シャットダウン" 等にマッピングされたら対応するフロー呼び出し
process_*_flow() 内でSikuliXの画面キャプチャ→ChatGPT Vision解析→ローカルマウス移動/クリック
Pi Zeroなど外部デバイス不要。マウス＆キーボードはSikuliXがOSに対して直接イベント注入する。

Q&A

Q1. ローカルOSを誤操作してしまうリスクは？

A1. もちろんリスクがあります。特に「シャットダウン」コマンドなどは実行すると自分の環境が落ちてしまう。安全策として仮想マシンやテスト用ユーザーアカウントを使うのが望ましい。

Q2. HDMIキャプチャデバイスを挟むメリットは？

A2. 自分自身の画面を物理ループバックするメリットはあまりなく、OBSやSikuliXのScreen() でソフト取得すれば十分。HDMIキャプチャは通常「別PCの画面を取り込みたい」場合に使います。

Q3. ChatGPT Visionに依存しない方法は？

A3. もちろんSikuliXの画像パターンマッチだけでも自動化可能。ChatGPT Visionは多言語UIや文字解析が必要な場合などに便利です。

Q4. 音声認識が動かない場合は？

A4. pyaudioのインストール、Google Cloud Credentials設定、マイクデバイスの選択などを確認してください。interim_resultsが多く出る場合は is_final判定をしっかり見る必要があります。

Q5. 将来Pi Zeroを使って別PCを操作したくなったら？

A5. コードの大部分(音声コマンド解析, ChatGPT Vision, etc.)は再利用でき、マウス/キーボード操作部分をPi ZeroのgRPCやHIDエミュレーションに切り替えるだけで済みます。

まとめ

自分のWindows PCを同じPC内で完全自動化（音声コマンド→SikuliX / ChatGPT Vision→マウスクリック/キー入力）するには、Pi Zeroなど外部USB HID不要。
SikuliXの click(), type() がOSに対してローカル操作を注入できるため、画面解析+操作が一体で実装可能。
音声コマンド（Google Cloud Speech）で「コンピュータログイン」などの発話をトリガーに実行すれば、ハンズフリーで自分のPCを操縦できる。
誤操作のリスクに注意しながら、テスト用アカウントや仮想マシンで試すと安全。本番運用に向けて機能を拡張しやすい構成です。

これで、ローカル環境でGUI自動化フロー（音声→SikuliX→ChatGPT Vision→マウス/キーボード制御）が完成します。ぜひ自動ログイン、ログアウト処理やシャットダウンシーケンスなどを試しながら、自分のPCの操作を自動化してみてください。

参考リンク

SikuliX公式サイト
PyAutoGUI (GitHub) (SikuliX代替案)
ChatGPT Vision (OpenAI公式)
Google Cloud Speech-to-Text
その他: Win32 API / UIAutomation / Power Automate Desktop などもローカル自動化の選択肢

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up