AI WebAgentツール BrowserUseを試してみた

Last updated at 2025-01-13Posted at 2025-01-13

概要

browser-useとは？

LLMとWebAgent

大規模言語モデルを使って、実世界(ブラウザでアクセスできる世界)を操作する機能の研究が進んでいます。2024秋時点では、WebAgentとかAI-Agentと呼称しているようです¹

[研究の目的]人間のWeb操作を代行

どの研究も人がパソコンに話しかけて、それをパソコンが理解してWeb操作を代行することが目標です。クルマに例えると「Web操作の自動運転」ですかね。

browser-use

こちらはそうしたAIAgent実装の一つで、リポジトリとともにMITラインセンスで提供されています。

Github リポジトリ

https://github.com/browser-use/browser-use

どんな動作をしているか？

AI Agentとして

指定されたサイトで何をすれば良いかを判断し、操作手順を生成
操作手順から
- 対象サイトの画面を分析
- Playwrightコードを生成し、操作方法を具体化
- 操作に失敗した時はコードを生成し直して再挑戦
  - それでも失敗する場合は、操作手順を見直し
- 結果が得られたら適宜Markdownやスクリーンショットなどに出力する

OpenAI以外のLLM利用

OpenAI以外のLLMの利用も容易です。リポジトリexamplesディレクトリにはQwenやollama、geminiなどへのアクセス方法が開示されています。

ブラウザ操作はPlaywrightを利用

Playwright(Microsoft提供OSS）を使っています

類似の研究

Minecraftを自動プレイするVoyagerに解説があるVoyager(2023)が近い印象です。

操作してみる

準備

ディレクトリ準備

適当なディレクトリを作成し、ターミナルで移動しておきます

インストール

# Python環境作成 (venvじゃなくても🆗です)
$ python3 -m venv venv

# BrowserUseをインストール
$ pip install browser-use 
$ playwright install

OpenAI-APIのクレデンシャル(API-Key)を保存

.envファイルにOpenAI-APIのクレデンシャル(API-Key)を保存しておきます
browser-useへのオプションもここで設定

実行スクリプトを準備

リポジトリのREADME.mdからデモスクリプトをファイル保存

実行

$ python3 ./00-helloworld.py

INFO     [browser_use] BrowserUse logging setup complete with level info
INFO     [root] Anonymized telemetry enabled. See https://github.com/gregpr07/browser-use for more information.
INFO     [agent] 🚀 Starting task: Go to Reddit, search for 'browser-use' in the search bar, click on the first post and return the first comment.
INFO     [agent] 
📍 Step 1
INFO     [agent] 👍 Eval: Success - The task has started on a blank page and I need to navigate to Reddit.
INFO     [agent] 🧠 Memory: Begin task on Reddit.
INFO     [agent] 🎯 Next goal: Navigate to Reddit's homepage and perform a search for 'browser-use'.
INFO     [agent] 🛠️  Action 1/1: {"search_google":{"query":"Reddit website homepage"},"go_to_url":{"url":"https://www.reddit.com/"},"go_back":{},"click_element":{"index":1,"xpath":null},"input_text":{"index":5,"text":"browser use","xpath":null},"switch_tab":{"page_id":0},"open_tab":{"url":"https://www.reddit.com/"},"extract_content":{"value":"text"},"done":{"text":"Extract first comment from the first Reddit post related to 'browser-use'. \nFirst comment: [Extracted text] "},"scroll_down":{"amount":50},"scroll_up":{"amount":null},"send_keys":{"keys":"Enter"},"scroll_to_text":{"text":"Comments"},"get_dropdown_options":{"index":0},"select_dropdown_option":{"index":0,"text":"browser use"}}
INFO     [controller] 🔍  Searched for "Reddit website homepage" in Google
INFO     [agent] 
📍 Step 2
INFO     [agent] ⚠ Eval: Failed - I mistakenly searched Google for Reddit instead of navigating to the Reddit homepage directly.
INFO     [agent] 🧠 Memory: The task is to find and navigate to Reddit's homepage.
INFO     [agent] 🎯 Next goal: Click on the Reddit homepage link to access the website.
INFO     [agent] 🛠️  Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":18,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO     [controller] 🖱️  Clicked index 18
INFO     [agent] 
📍 Step 3
INFO     [agent] 👍 Eval: Success - Navigated to Reddit's homepage successfully.
INFO     [agent] 🧠 Memory: Currently on Reddit. Need to search for 'browser-use'.
INFO     [agent] 🎯 Next goal: Input 'browser-use' into the search bar on Reddit and initiate the search.
INFO     [agent] 🛠️  Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":null,"input_text":{"index":2,"text":"browser-use","xpath":null},"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":{"keys":"Enter"},"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO     [controller] ⌨️  Input "browser-use" into index 2
INFO     [agent] 
📍 Step 4
INFO     [agent] 👍 Eval: Success - Input was entered into the search bar for 'browser-use'.
INFO     [agent] 🧠 Memory: Search for 'browser-use' initiated on Reddit.
INFO     [agent] 🎯 Next goal: Click on the option to search for 'browser-use'.
INFO     [agent] 🛠️  Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":15,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO     [controller] 🖱️  Clicked index 15
INFO     [agent] 
📍 Step 5
INFO     [agent] 👍 Eval: Success - Search results for 'browser-use' are displayed.
INFO     [agent] 🧠 Memory: Search results loaded. Next step is to click on the first post.
INFO     [agent] 🎯 Next goal: Click on the first post in the search results to view its details.
INFO     [agent] 🛠️  Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":10,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
WARNING  [controller] Element no longer available with index 10 - most likely the page changed
INFO     [agent] 
📍 Step 6
INFO     [agent] ⚠ Eval: Failed - Attempted to click on an incorrect element. First post was not selected.
INFO     [agent] 🧠 Memory: Need to correctly select the first post related to 'browser-use'.
INFO     [agent] 🎯 Next goal: Click the first post titled 'Browser Use' under 'r/LocalLLaMA'.
INFO     [agent] 🛠️  Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":12,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO     [controller] 🖱️  Clicked index 12
INFO     [agent] 
📍 Step 7
INFO     [agent] 👍 Eval: Success - Successfully navigated to the post page 'Browser Use'.
INFO     [agent] 🧠 Memory: Post details accessed. Ready to view comments.
INFO     [agent] 🎯 Next goal: Scroll to and click the 'Go to comments' button to view the comments.
INFO     [agent] 🛠️  Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":11,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO     [controller] 🖱️  Clicked index 11
INFO     [agent] 
📍 Step 8
INFO     [agent] 👍 Eval: Success - First comment is visible on the post page.
INFO     [agent] 🧠 Memory: Accessed the first comment of the post.
INFO     [agent] 🎯 Next goal: Retrieve the first comment.
INFO     [agent] 🛠️  Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":null,"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":{"text":"The first comment under the post:\n\n\"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required\"\n\nComment by: grigio"},"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO     [agent] 📄 Result: The first comment under the post:

"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required"

Comment by: grigio
INFO     [agent] ✅ Task completed successfully
INFO     [agent] Created GIF at agent_history.gif
AgentHistoryList(all_results=[ActionResult(is_done=False, extracted_content='🔍  Searched for "Reddit website homepage" in Google', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='🖱️  Clicked index 18', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='⌨️  Input "browser-use" into index 2', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='🖱️  Clicked index 15', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content=None, error='Failed to click element: <img src="https://b.thumbs.redditmedia.com/5CjwrCLiYs0_bivakTc5BmgQWT-J6-x0aaJSB99IPEc.png" srcset="" sizes="" alt="" browser-user-highlight-id="playwright-highlight-10"> [interactive, top, highlight:10]. Error: Element: <img src="https://b.thumbs.redditmedia.com/5CjwrCLiYs0_bivakTc5BmgQWT-J6-x0aaJSB99IPEc.png" srcset="" sizes="" alt="" browser-user-highlight-id="playwright-highlight-10"> [interactive, top, highlight:10] not found', include_in_memory=False), ActionResult(is_done=False, extracted_content='🖱️  Clicked index 12', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='🖱️  Clicked index 11', error=None, include_in_memory=True), ActionResult(is_done=True, extracted_content='The first comment under the post:\n\n"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required"\n\nComment by: grigio', error=None, include_in_memory=False)], all_model_outputs=[{'search_google': {'query': 'Reddit website homepage'}, 'go_to_url': {'url': 'https://www.reddit.com/'}, 'go_back': {}, 'click_element': {'index': 1}, 'input_text': {'index': 5, 'text': 'browser use'}, 'switch_tab': {'page_id': 0}, 'open_tab': {'url': 'https://www.reddit.com/'}, 'extract_content': {'value': 'text'}, 'done': {'text': "Extract first comment from the first Reddit post related to 'browser-use'. \nFirst comment: [Extracted text] "}, 'scroll_down': {'amount': 50}, 'scroll_up': {}, 'send_keys': {'keys': 'Enter'}, 'scroll_to_text': {'text': 'Comments'}, 'get_dropdown_options': {'index': 0}, 'select_dropdown_option': {'index': 0, 'text': 'browser use'}}, {'click_element': {'index': 18}}, {'input_text': {'index': 2, 'text': 'browser-use'}, 'send_keys': {'keys': 'Enter'}}, {'click_element': {'index': 15}}, {'click_element': {'index': 10}}, {'click_element': {'index': 12}}, {'click_element': {'index': 11}}, {'done': {'text': 'The first comment under the post:\n\n"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required"\n\nComment by: grigio'}}])

$

画面操作ログ

デフォルトでは実行ディレクトリに画面操作ログをagent-history.gifに保存しています

最後に

さらっと動作だけ紹介しました。
なかなか興味深い取り組みですので、深掘りして新しい情報を投稿していきたいと思います

参考リンク

https://note.com/shi3zblog/n/n960fc72b36e9?sub_rt=share_b

某メーカがWebAgentを使った商標を取得しており今後変わるかも... ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up