AI エージェントがウェブブラウザを操作できるようにする Browser-Use の実現方法を解析する

Last updated at 2025-01-01Posted at 2025-01-01

はじめに

Browser Use のソースコードを読んだ中で重要だと感じたポイントをまとめてメモとして残しておく

Browser Use とは

Browser Use は、AIエージェントがウェブブラウザを自動操作できるようにするPython製のライブラリです。これにより、ウェブページ上の情報収集や特定タスクの自動実行が可能となります。

主な機能:

ウェブ要素の抽出: ボタンやリンク、フォームなどの要素を自動認識し操作できます
視覚情報とHTML構造の統合: ウェブページの視覚情報とHTML構造を同時に処理し、複雑な操作も正確に実行します
自動マルチタブ管理: 複数のタブを同時に開き並行して操作できます
カスタムアクション: ファイル保存やデータベース操作、通知送信など、ユーザー独自のアクションを定義可能です
自己修正機能: 操作失敗時や要素未検出時に、自動で修正し再試行します
複数エージェントの並列処理: 複数のエージェントを同時に実行し、効率的なタスク処理が可能です

対応する大規模言語モデル(LLM):

GPT-4、Claude 3.5など、LangChainでサポートされている複数のLLMと組み合わせて使用できます

Browser Use の紹介記事

Browser Use の紹介や利用方法は下記の記事に詳しい

ソースコードの構成

フォルダ名	役割
agent	エージェントロジックを管理。タスク処理、プロンプト生成、メッセージ履歴管理を行い、LLMとの通信を担当する
browser	ブラウザ操作を抽象化している。ブラウザインスタンスやセッション（コンテキスト）の管理を行い、ヘッドレス等をサポート
controller	エージェントとブラウザ間の橋渡し。アクション管理とブラウザへの具体的な操作指示を行い、カスタムアクションの登録機能を提供
dom	DOM操作と解析を担当。ページのDOMツリー構築や履歴比較、要素の一致特定などを行う
telemetry	各種メトリクスを記録し、パフォーマンス分析や動作の効率化・問題点の特定に役立てる
tests	システム全体の動作確認や単体テストのコードを含む。Pytestの設定ファイルも含まれる
utils.py	プロジェクト全体で使用される汎用的な関数を提供。時間計測、データ変換、ファイル操作などを実装している
logging_config.py	ログ出力の設定を行い、独自ログレベル「RESULT」を追加。動作追跡やデバッグをサポート

全体の処理の流れ

User（ユーザー）
自然言語で「何をしたいか」を Agent に依頼
Agent
- タスク（自然言語）を MessageManager を介して LLM に問い合わせるためのプロンプトを組み立て (SystemMessage, HumanMessage)
- LLM の出力を JSON (AgentOutput) にパースし、action (go_to_url, click_element 等) を取り出して Controller に渡す
Controller
- 受け取った action を BrowserContext に順次指示
- クリックや入力などの具体操作は BrowserContext が実行
BrowserContext / Browser
- Playwright を通じてページを操作
- 結果やエラーを ActionResult としてまとめて Agent に返す
- ページ状態（URL、タブ一覧、スクリーンショットなど）を再度 Agent に返却可能
Agent (再度)
- 返ってきた ActionResult (抽出テキストやエラーなど) を次の LLM 入力として利用
- 必要に応じて繰り返し、done アクションでユーザーに完了を報告

状態遷移

(1)～(3) ユーザーからタスクを受け取った Agent は、必要に応じて BrowserContext からページ状態（URL,要素一覧,スクリーンショットなど）を取得
(4)～(5) Agent はその状態を踏まえて Prompts / MessageManager を通じて LLM に送るためのチャット履歴を整形
(6)～(7) LLM は JSON(AgentOutput) 形式で「次に実行すべきアクション」（go_to_url や click_element など）を返す
(8)～(12) Agent は Controller へアクション実行を依頼し、BrowserContext → Browser を用いて実際のブラウザ操作を行う
実行結果 は ActionResult として Agent に返され、必要があれば再び状態を取得して LLM に問い合わせるステップを続け、最終的に完了かユーザーへの結果報告となる

シーケンス

User が自然言語で Agent に依頼
Agent は内部で MessageManager を使ってチャット履歴を管理し、ブラウザ状態(BrowserContext から取得)や過去アクション結果(ActionResult)を含めて LLM へ送信
LLM は JSON 形式で次に行うべきアクション（go_to_url, click_element, input_text, …）と「メモリ」や「次の目標」を出力
Agent はその指示を解析して Controller に渡し、BrowserContext → Browser を通じて実操作を行う
実行結果 は ActionResult として Agent に戻り、再び MessageManager 経由で LLM に渡され…という流れを繰り返し、最終的に done アクションでタスク完了を報告して終了

メッセージのやり取り

Agent, LLM と Controller のメッセージのやり取りを確認する

AgentからLLMへ

MessageManager は以下のような「リスト状のメッセージ」を用意し、LLM に送ります

[
  SystemMessage(
    content="""
    You are a precise browser automation agent...
    (省略: JSON形式で返す等の重要ルール)
    """
  ),
  HumanMessage(
    content="""
    Current url: https://www.google.com
    Interactive elements:
    1[:]<input>Search Query</input>
    2[:]<button>Search</button>

    ...
    """
  )
]

SystemMessage:「ルール」「必須フォーマット」「使用できるアクション一覧」
HumanMessage:「タスク」や「今の状態」「前アクション結果」

SystemMessage

SystemMessage は、エージェントに課す「大事なルール」や「応答形式」を定義するためのメッセージです。例えば、次のような内容が含まれます

SystemMessage

You are a precise browser automation agent that interacts with websites through structured commands. 
Your responses must be valid JSON matching the specified format.

IMPORTANT RULES:
1. Always respond with valid JSON:
   {
     "current_state": {
       "evaluation_previous_goal": "...",
       "memory": "...",
       "next_goal": "..."
     },
     "action": [
       {
         "go_to_url": { "url": "https://..." }
       },
       ...
     ]
   }

2. Only use actions that are described below:
   - go_to_url, click_element, input_text, extract_content, ...
3. If the task is done, use "done" action:
   { "done": { "text": "Task completed." } }
...

LLM が JSON 構造（AgentOutput）で返答するように強く促すための情報が含まれる
Agent の実装では、system_prompt_class.get_system_message() などでプロンプトが生成され、チャット履歴の先頭（システムメッセージ）として投入されます

HumanMessage

Agent が現在のブラウザ状態や前ステップでの結果を HumanMessage として組み立て、LLM へ渡します。たとえば、次のような形で記述されます

Current step: 3/10
Current url: https://www.google.com
Available tabs:
 - [0] https://www.google.com  (title="Google")
Interactive elements:
1[:]<input>Search terms</input>
2[:]<button>Google Search</button>
...

Result of action: 
 - "Successfully navigated to google.com"

Error of action: 
 - (none)

場合によっては base64 画像を ChatMessage の一部として送ることもある（Vision対応）

LLMからAgentへ

Agent が LLM へ「次のアクションを決定する」ための問い合わせを行い、その返答を構造化データとして受け取り、AgentOutput（pydanticモデル）にパースして返す
self.llm（LangChainのチャットモデルなど）が、AgentOutput（pydantic で定義されたモデル）形式で応答を受け取れるように設定している

agent/service.py

	@time_execution_async('--get_next_action')
	async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutput:
		"""Get next action from LLM based on current state"""

		structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)
		response: dict[str, Any] = await structured_llm.ainvoke(input_messages)  # type: ignore

		parsed: AgentOutput = response['parsed']
		# cut the number of actions to max_actions_per_step
		parsed.action = parsed.action[: self.max_actions_per_step]
		self._log_response(parsed)
		self.n_steps += 1

		return parsed

AgentOutput の中には、current_state や action のリスト（[ActionModel, ...]）が格納されている

例

{
  "current_state": {
    "evaluation_previous_goal": "Success",
    "memory": "さきほど検索ボックスへ入力した",
    "next_goal": "検索ボタンを押して結果を確認する"
  },
  "action": [
    {
      "click_element": {
        "index": 17,
        "xpath": "//button[@id='search-button']"
      }
    },
    {
      "extract_content": {
        "value": "text"
      }
    }
  ]
}

AgentからController (アクション実行)

Agent から Controller を呼び出す際に渡されるメッセージ内容は、 LLM から Agent へ返された「どのアクションをどのパラメータで呼び出すか」という指示を Agent がパースした結果のActionModelのリストを Controller にまとめて渡します。

agent/service.py

async def step(self, step_info: Optional[AgentStepInfo] = None) -> None:
...
			result: list[ActionResult] = await self.controller.multi_act(
				model_output.action, self.browser_context
			)
...

ControllerからAgent (アクションの結果)

Controller から Agent に返されるのは ActionResult オブジェクト

フィールド名	型	説明
is_done	bool	タスク完了を示すフラグ。`True` なら全体の処理を終了する
extracted_content	str or None	抽出・操作の結果として得られた文字列や情報など。ページテキスト・URL・要約等を入れることが多い
error	str or None	エラーが発生した場合のメッセージ
include_in_memory	bool	その結果をLLMに追加で送るかどうかのフラグ。`True` だと次回以降のプロンプトに「ヒント」として含まれる

go_to_urlの場合

{
  "is_done": false,
  "extracted_content": "🔗  Navigated to https://example.com",
  "error": null,
  "include_in_memory": true
}

click_element の場合

{
  "is_done": false,
  "extracted_content": "🖱️  Clicked index 10",
  "error": null,
  "include_in_memory": true
}

実行例：Googleで ‘OpenAI’ を検索 → 最初の結果を取得

Agentが search_google アクションでGoogle検索→ extract_content で結果を抽出
ActionResult が戻り、最初の検索結果を表示

ソースコード

search_openai_by_google.py

from langchain_openai import ChatOpenAI
from browser_use import Agent
import asyncio

async def main():
    agent = Agent(
        task="Googleで ‘OpenAI’ を検索して最初の結果を取得",
        llm=ChatOpenAI(model="gpt-4o"),
    )
    result = await agent.run()
    print(result)

asyncio.run(main())

実行結果

$ python3 search_openai_by_google.py
INFO     [browser_use] BrowserUse logging setup complete with level info
INFO     [root] Anonymized telemetry enabled. See https://github.com/gregpr07/browser-use for more information.
INFO     [agent] 🚀 Starting task: Googleで ‘OpenAI’ を検索して最初の結果を取得
INFO     [agent] 
📍 Step 1
INFO     [agent] 🤷 Eval: Unknown - The current page is blank, so no previous goals were being pursued.
INFO     [agent] 🧠 Memory: Starting a Google search for 'OpenAI'.
INFO     [agent] 🎯 Next goal: Perform a Google search for 'OpenAI' and extract the first search result.
INFO     [agent] 🛠️  Action 1/1: {"search_google":{"query":"OpenAI"}}
INFO     [controller] 🔍  Searched for "OpenAI" in Google
INFO     [agent] 
📍 Step 2
INFO     [agent] 👍 Eval: Success - The search for 'OpenAI' has been completed and results are visible.
INFO     [agent] 🧠 Memory: Google search for 'OpenAI' completed; extract the first result.
INFO     [agent] 🎯 Next goal: Extract the first search result for 'OpenAI'.
INFO     [agent] 🛠️  Action 1/1: {"done":{"text":"The first search result for 'OpenAI' is:\n\nTitle: ChatGPT - OpenAI\nURL: https://openai.com/ja-JP/chatgpt/overview"}}
INFO     [agent] 📄 Result: The first search result for 'OpenAI' is:

Title: ChatGPT - OpenAI
URL: https://openai.com/ja-JP/chatgpt/overview
INFO     [agent] ✅ Task completed successfully
AgentHistoryList(all_results=[ActionResult(is_done=False, extracted_content='🔍  Searched for "OpenAI" in Google', error=None, include_in_memory=True), ActionResult(is_done=True, extracted_content="The first search result for 'OpenAI' is:\n\nTitle: ChatGPT - OpenAI\nURL: https://openai.com/ja-JP/chatgpt/overview", error=None, include_in_memory=False)], all_model_outputs=[{'search_google': {'query': 'OpenAI'}}, {'done': {'text': "The first search result for 'OpenAI' is:\n\nTitle: ChatGPT - OpenAI\nURL: https://openai.com/ja-JP/chatgpt/overview"}}])

Browser-Use で実現可能な範囲と Controller の役割

Agent は LLM に入力されたタスクを Controller が用意している複数のアクションにブレークダウンさせます。作成したアクションのリストを逐次実行するのが Controller の役割になります。つまり、Browser-Use は Controller が事前に用意しているアクション以外のことはできません。

Controller　に含まれるアクション一覧は次のようなものになります。

ナビゲーション操作: URL移動、タブ切り替え、戻る操作を提供する
要素操作: クリック、入力、ドロップダウン選択など、DOM要素の操作を担当する
コンテンツ抽出: ページの主要コンテンツやオプションリストを取得する
スクロール: ページ内の特定の位置までスクロールする
検索機能: Google検索をシンプルに実行する
タスク完了通知: タスクの終了をシステムに伝える

アクション一覧

カテゴリ	アクション名	役割
検索関連	search_google	Google検索を実行し、指定されたキーワードを検索する
	extract_content	現在のページのメインコンテンツを抽出する
ナビゲーション関連	go_to_url	指定されたURLに移動する
	go_back	現在のタブで「戻る」操作を実行する
	switch_tab	指定されたタブIDに切り替える
	open_tab	新しいタブを開き、指定されたURLを表示する
要素操作関連	click_element	指定された要素をクリックする
	input_text	指定された入力フィールドにテキストを入力する
スクロール関連	scroll_down	ページを指定されたピクセル数だけ下にスクロールする
	scroll_up	ページを指定されたピクセル数だけ上にスクロールする
	scroll_to_text	指定されたテキストが表示されている位置までスクロールする
ドロップダウン関連	get_dropdown_options	ドロップダウン要素のすべての選択肢を取得する
	select_dropdown_option	ドロップダウン内の指定されたオプションを選択する
完了処理	done	タスクの完了を示す

Controller の機能追加

Controller はカスタマイズ可能で、新しいアクションを追加することも可能です。その柔軟性により、Web自動化タスクの幅広いニーズをサポートすることができます。

検索した結果をJSONファイル("search_result.json")に保存するアクションとしてsave_search_result()関数を追加して実行してみます。

search_openai_by_google_and_save.py

import asyncio
import logging
import json
from langchain_openai import ChatOpenAI
from browser_use import Agent, Controller
from browser_use.browser.context import BrowserContext  
from browser_use.agent.views import ActionResult 

# ログ設定
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("controller")

controller = Controller()

@controller.action("Save search result")
async def save_search_result(params: dict, browser: BrowserContext = None):
    """検索結果をJSONファイルに保存""" 
    try:
        msg = f'🔍  Saving search result to file: {params}'
        logger.info(msg)  # 保存開始時のログ
        with open("search_result.json", "w") as file:
            json.dump(params, file)
        msg = f'✅ Search result saved successfully: {params}'
        logger.info(msg)  # 保存成功時のログ
        return ActionResult(extracted_content=msg, include_in_memory=True)
    except Exception as e:
        msg = f'❌ Error saving search result: {str(e)}'
        logger.error(msg)  # エラー時のログ
        return ActionResult(error=msg)

async def main():
    agent = Agent(
        task="Googleで 'OpenAI' を検索して最初の結果を取得し、保存してください。",
        llm=ChatOpenAI(model="gpt-4o"),
        controller=controller
    )
    result = await agent.run()
    print("タスク完了:", result)

asyncio.run(main())

作成したカスタムアクションを実行させてGoogleで 'OpenAI' を検索して最初の結果をJSONファイルに保存させます。

$　python3 search_openai_by_google_and_save.py
INFO     [browser_use] BrowserUse logging setup complete with level info
INFO     [root] Anonymized telemetry enabled. See https://github.com/gregpr07/browser-use for more information.
INFO     [agent] 🚀 Starting task: Googleで 'OpenAI' を検索して最初の結果を取得し、保存してください。
INFO     [agent] 
📍 Step 1
INFO     [agent] 🤷 Eval: Unknown - Starting new action sequence for the Google search task.
INFO     [agent] 🧠 Memory: Need to search 'OpenAI' on Google and save the first result.
INFO     [agent] 🎯 Next goal: Perform a Google search for 'OpenAI'.
INFO     [agent] 🛠️  Action 1/1: {"search_google":{"query":"OpenAI"}}
INFO     [controller] 🔍  Searched for "OpenAI" in Google
INFO     [agent] 
📍 Step 2
INFO     [agent] 👍 Eval: Success - The search for 'OpenAI' was executed and the first result is visible.
INFO     [agent] 🧠 Memory: Need to save the first result from the Google search for 'OpenAI'.
INFO     [agent] 🎯 Next goal: Extract and save the first search result link and description.
INFO     [agent] 🛠️  Action 1/2: {"save_search_result":{"params":{"title":"ChatGPT - OpenAI","url":"https://openai.com/ja-JP/chatgpt/overview"}}}
INFO     [agent] 🛠️  Action 2/2: {"done":{"text":"The first search result for 'OpenAI' has been saved: Title - 'ChatGPT - OpenAI', URL - 'https://openai.com/ja-JP/chatgpt/overview'."}}
INFO     [controller] 🔍  Saving search result to file: {'title': 'ChatGPT - OpenAI', 'url': 'https://openai.com/ja-JP/chatgpt/overview'}
INFO     [controller] ✅ Search result saved successfully: {'title': 'ChatGPT - OpenAI', 'url': 'https://openai.com/ja-JP/chatgpt/overview'}
INFO     [agent] 📄 Result: The first search result for 'OpenAI' has been saved: Title - 'ChatGPT - OpenAI', URL - 'https://openai.com/ja-JP/chatgpt/overview'.
INFO     [agent] ✅ Task completed successfully
タスク完了: AgentHistoryList(all_results=[ActionResult(is_done=False, extracted_content='🔍  Searched for "OpenAI" in Google', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content="✅ Search result saved successfully: {'title': 'ChatGPT - OpenAI', 'url': 'https://openai.com/ja-JP/chatgpt/overview'}", error=None, include_in_memory=True), ActionResult(is_done=True, extracted_content="The first search result for 'OpenAI' has been saved: Title - 'ChatGPT - OpenAI', URL - 'https://openai.com/ja-JP/chatgpt/overview'.", error=None, include_in_memory=False)], all_model_outputs=[{'search_google': {'query': 'OpenAI'}}, {'save_search_result': {'params': {'title': 'ChatGPT - OpenAI', 'url': 'https://openai.com/ja-JP/chatgpt/overview'}}}, {'done': {'text': "The first search result for 'OpenAI' has been saved: Title - 'ChatGPT - OpenAI', URL - 'https://openai.com/ja-JP/chatgpt/overview'."}}])

search_result.jsonに検索結果が保存された事を確認できました。

$ cat search_result.json
{"title": "ChatGPT - OpenAI", "url": "https://openai.com/ja-JP/chatgpt/overview"}%

Browser-Use は Controller の用意したアクションの範囲でしか動作できませんが、そのアクションを追加することで「できる」範囲を広げることが可能です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up