概要
browser-useとは?
LLMとWebAgent
大規模言語モデルを使って、実世界(ブラウザでアクセスできる世界)を操作する機能の研究が進んでいます。2024秋時点では、WebAgentとかAI-Agentと呼称しているようです1
[研究の目的]人間のWeb操作を代行
どの研究も人がパソコンに話しかけて、それをパソコンが理解してWeb操作を代行することが目標です。クルマに例えると「Web操作の自動運転」ですかね。
browser-use
こちらはそうしたAIAgent実装の一つで、リポジトリとともにMITラインセンスで提供されています。
Github リポジトリ
https://github.com/browser-use/browser-use
どんな動作をしているか?
AI Agentとして
- 指定されたサイトで何をすれば良いかを判断し、操作手順を生成
- 操作手順から
- 対象サイトの画面を分析
- Playwrightコードを生成し、操作方法を具体化
- 操作に失敗した時はコードを生成し直して再挑戦
- それでも失敗する場合は、操作手順を見直し
- 結果が得られたら適宜Markdownやスクリーンショットなどに出力する
OpenAI以外のLLM利用
OpenAI以外のLLMの利用も容易です。リポジトリexamplesディレクトリにはQwenやollama、geminiなどへのアクセス方法が開示されています。
ブラウザ操作はPlaywrightを利用
Playwright(Microsoft提供OSS)を使っています
類似の研究
Minecraftを自動プレイするVoyagerに解説があるVoyager(2023)が近い印象です。
操作してみる
準備
ディレクトリ準備
適当なディレクトリを作成し、ターミナルで移動しておきます
インストール
# Python環境作成 (venvじゃなくても🆗です)
$ python3 -m venv venv
# BrowserUseをインストール
$ pip install browser-use
$ playwright install
OpenAI-APIのクレデンシャル(API-Key)を保存
実行スクリプトを準備
リポジトリのREADME.mdからデモスクリプトをファイル保存
実行
$ python3 ./00-helloworld.py
INFO [browser_use] BrowserUse logging setup complete with level info
INFO [root] Anonymized telemetry enabled. See https://github.com/gregpr07/browser-use for more information.
INFO [agent] 🚀 Starting task: Go to Reddit, search for 'browser-use' in the search bar, click on the first post and return the first comment.
INFO [agent]
📍 Step 1
INFO [agent] 👍 Eval: Success - The task has started on a blank page and I need to navigate to Reddit.
INFO [agent] 🧠 Memory: Begin task on Reddit.
INFO [agent] 🎯 Next goal: Navigate to Reddit's homepage and perform a search for 'browser-use'.
INFO [agent] 🛠️ Action 1/1: {"search_google":{"query":"Reddit website homepage"},"go_to_url":{"url":"https://www.reddit.com/"},"go_back":{},"click_element":{"index":1,"xpath":null},"input_text":{"index":5,"text":"browser use","xpath":null},"switch_tab":{"page_id":0},"open_tab":{"url":"https://www.reddit.com/"},"extract_content":{"value":"text"},"done":{"text":"Extract first comment from the first Reddit post related to 'browser-use'. \nFirst comment: [Extracted text] "},"scroll_down":{"amount":50},"scroll_up":{"amount":null},"send_keys":{"keys":"Enter"},"scroll_to_text":{"text":"Comments"},"get_dropdown_options":{"index":0},"select_dropdown_option":{"index":0,"text":"browser use"}}
INFO [controller] 🔍 Searched for "Reddit website homepage" in Google
INFO [agent]
📍 Step 2
INFO [agent] ⚠ Eval: Failed - I mistakenly searched Google for Reddit instead of navigating to the Reddit homepage directly.
INFO [agent] 🧠 Memory: The task is to find and navigate to Reddit's homepage.
INFO [agent] 🎯 Next goal: Click on the Reddit homepage link to access the website.
INFO [agent] 🛠️ Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":18,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO [controller] 🖱️ Clicked index 18
INFO [agent]
📍 Step 3
INFO [agent] 👍 Eval: Success - Navigated to Reddit's homepage successfully.
INFO [agent] 🧠 Memory: Currently on Reddit. Need to search for 'browser-use'.
INFO [agent] 🎯 Next goal: Input 'browser-use' into the search bar on Reddit and initiate the search.
INFO [agent] 🛠️ Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":null,"input_text":{"index":2,"text":"browser-use","xpath":null},"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":{"keys":"Enter"},"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO [controller] ⌨️ Input "browser-use" into index 2
INFO [agent]
📍 Step 4
INFO [agent] 👍 Eval: Success - Input was entered into the search bar for 'browser-use'.
INFO [agent] 🧠 Memory: Search for 'browser-use' initiated on Reddit.
INFO [agent] 🎯 Next goal: Click on the option to search for 'browser-use'.
INFO [agent] 🛠️ Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":15,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO [controller] 🖱️ Clicked index 15
INFO [agent]
📍 Step 5
INFO [agent] 👍 Eval: Success - Search results for 'browser-use' are displayed.
INFO [agent] 🧠 Memory: Search results loaded. Next step is to click on the first post.
INFO [agent] 🎯 Next goal: Click on the first post in the search results to view its details.
INFO [agent] 🛠️ Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":10,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
WARNING [controller] Element no longer available with index 10 - most likely the page changed
INFO [agent]
📍 Step 6
INFO [agent] ⚠ Eval: Failed - Attempted to click on an incorrect element. First post was not selected.
INFO [agent] 🧠 Memory: Need to correctly select the first post related to 'browser-use'.
INFO [agent] 🎯 Next goal: Click the first post titled 'Browser Use' under 'r/LocalLLaMA'.
INFO [agent] 🛠️ Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":12,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO [controller] 🖱️ Clicked index 12
INFO [agent]
📍 Step 7
INFO [agent] 👍 Eval: Success - Successfully navigated to the post page 'Browser Use'.
INFO [agent] 🧠 Memory: Post details accessed. Ready to view comments.
INFO [agent] 🎯 Next goal: Scroll to and click the 'Go to comments' button to view the comments.
INFO [agent] 🛠️ Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":{"index":11,"xpath":null},"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":null,"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO [controller] 🖱️ Clicked index 11
INFO [agent]
📍 Step 8
INFO [agent] 👍 Eval: Success - First comment is visible on the post page.
INFO [agent] 🧠 Memory: Accessed the first comment of the post.
INFO [agent] 🎯 Next goal: Retrieve the first comment.
INFO [agent] 🛠️ Action 1/1: {"search_google":null,"go_to_url":null,"go_back":null,"click_element":null,"input_text":null,"switch_tab":null,"open_tab":null,"extract_content":null,"done":{"text":"The first comment under the post:\n\n\"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required\"\n\nComment by: grigio"},"scroll_down":null,"scroll_up":null,"send_keys":null,"scroll_to_text":null,"get_dropdown_options":null,"select_dropdown_option":null}
INFO [agent] 📄 Result: The first comment under the post:
"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required"
Comment by: grigio
INFO [agent] ✅ Task completed successfully
INFO [agent] Created GIF at agent_history.gif
AgentHistoryList(all_results=[ActionResult(is_done=False, extracted_content='🔍 Searched for "Reddit website homepage" in Google', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='🖱️ Clicked index 18', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='⌨️ Input "browser-use" into index 2', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='🖱️ Clicked index 15', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content=None, error='Failed to click element: <img src="https://b.thumbs.redditmedia.com/5CjwrCLiYs0_bivakTc5BmgQWT-J6-x0aaJSB99IPEc.png" srcset="" sizes="" alt="" browser-user-highlight-id="playwright-highlight-10"> [interactive, top, highlight:10]. Error: Element: <img src="https://b.thumbs.redditmedia.com/5CjwrCLiYs0_bivakTc5BmgQWT-J6-x0aaJSB99IPEc.png" srcset="" sizes="" alt="" browser-user-highlight-id="playwright-highlight-10"> [interactive, top, highlight:10] not found', include_in_memory=False), ActionResult(is_done=False, extracted_content='🖱️ Clicked index 12', error=None, include_in_memory=True), ActionResult(is_done=False, extracted_content='🖱️ Clicked index 11', error=None, include_in_memory=True), ActionResult(is_done=True, extracted_content='The first comment under the post:\n\n"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required"\n\nComment by: grigio', error=None, include_in_memory=False)], all_model_outputs=[{'search_google': {'query': 'Reddit website homepage'}, 'go_to_url': {'url': 'https://www.reddit.com/'}, 'go_back': {}, 'click_element': {'index': 1}, 'input_text': {'index': 5, 'text': 'browser use'}, 'switch_tab': {'page_id': 0}, 'open_tab': {'url': 'https://www.reddit.com/'}, 'extract_content': {'value': 'text'}, 'done': {'text': "Extract first comment from the first Reddit post related to 'browser-use'. \nFirst comment: [Extracted text] "}, 'scroll_down': {'amount': 50}, 'scroll_up': {}, 'send_keys': {'keys': 'Enter'}, 'scroll_to_text': {'text': 'Comments'}, 'get_dropdown_options': {'index': 0}, 'select_dropdown_option': {'index': 0, 'text': 'browser use'}}, {'click_element': {'index': 18}}, {'input_text': {'index': 2, 'text': 'browser-use'}, 'send_keys': {'keys': 'Enter'}}, {'click_element': {'index': 15}}, {'click_element': {'index': 10}}, {'click_element': {'index': 12}}, {'click_element': {'index': 11}}, {'done': {'text': 'The first comment under the post:\n\n"The readme says it supports Llama 405B but no examples are provided :( It seems a model with multiple images and tool calling is required"\n\nComment by: grigio'}}])
$
画面操作ログ
デフォルトでは実行ディレクトリに画面操作ログをagent-history.gif
に保存しています
最後に
さらっと動作だけ紹介しました。
なかなか興味深い取り組みですので、深掘りして新しい情報を投稿していきたいと思います
参考リンク
-
某メーカがWebAgentを使った商標を取得しており今後変わるかも... ↩