Seleniumで動的サイトのスクレイピング

Posted at 2024-06-05

概要

ログインが必要なWEBサイトのスクレイピングとして，WEBブラウザの自動操作を行うSeleniumを用いる方法があります．
流れとしてはSeleniumでログインを突破し，取得したHTMLをBeautifulSoupで解析するといった感じです．
今回は検証として，Googleで「Python」と検索した時の検索結果を取得してみました．

実行環境

動作環境

Python v.3.9.9
Windows 11

Windows上で既にPythonを実行できる前提で進めます．

また，Seleniumのバージョンが4.6以上だと，ChromeDriverを手動でダウンロードしなくてもいいみたいです．
詳しくは下を読んでください．

1. 仮想環境の構築

作業ディレクトリに移動して，下のコマンドを実行します．

$ python -m venv .venv

実行すると作業ディレクトリ直下に .venv といった名前のディレクトリが作成されます．

2. 仮想環境のアクティベート

$ .venv\scripts\activate

実行してうまくいくと，コマンドプロンプトの先頭が C:\Users\x> から (.venv) C\Users\xxx> になるはずです．

3. 必要なライブラリのインストール

pip のバージョンを更新し，seleniumとbs4をインストールしましょう

$ python -m pip install -U pip
$ python -m pip install selenium requests bs4

次のコマンドでインストールされたライブラリ一覧を確認できます

$ python -m pip list

プログラムの実行

こちらの記事から一部引用

下のPythonスクリプト (scraping.py) をコピペして実行．実行結果が output のようになれば成功！！（検索アルゴリズム的な問題で，完全には一致しないかもしれません）

scraping.py

# ライブラリの読み込み
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# Web Driverの設定
driver = webdriver.Chrome()

# Googleのページを開く
driver.get("https://google.com")

# 検索ボックスの要素を指定
elm = driver.find_element(By.NAME, "q")

# 検索ボックスに「Python」を入力
elm.send_keys("Python")

# エンターキーを押す
elm.send_keys(Keys.ENTER)

sleep(1)

# 検索結果のHTMLを取得
html = driver.page_source.encode("utf-8")
soup = BeautifulSoup(html, "html.parser")

# ページタイトルを抽出して表示
# CSSセレクタで要素を指定
print(*soup.select(".LC20lb.MBeuO.DKV0Md"), sep="\n")

# Web Driverの終了
driver.quit()

output

<h3 class="LC20lb MBeuO DKV0Md">python.jp: プログラミング言語 Python 総合情報サイト</h3>
<h3 class="LC20lb MBeuO DKV0Md">Welcome to Python.org</h3>
<h3 class="LC20lb MBeuO DKV0Md">初心者からPythonを実務で使えるためのスキル習得時間は？お勧めの勉強 ...</h3>
<h3 class="LC20lb MBeuO DKV0Md">Pythonでできることとは？機械学習を使ったAI開発をはじめとした実例を ...</h3>
<h3 class="LC20lb MBeuO DKV0Md">Pythonとは？3分で分かる人気の理由と基礎知識 - かっこ株式会社</h3>
<h3 class="LC20lb MBeuO DKV0Md">Pythonでできること6選！仕事への活用方法から学習方法まで解説</h3>
<h3 class="LC20lb MBeuO DKV0Md">Python</h3>
<h3 class="LC20lb MBeuO DKV0Md">【入門】Pythonとは｜活用事例やメリット、できること、学習方法 ...</h3>
<h3 class="LC20lb MBeuO DKV0Md">Pythonの開発環境を用意しよう！（Windows）</h3>
<h3 class="LC20lb MBeuO DKV0Md">Pythonとは？プログラミング言語の用途を初心者向けに解説</h3>
<h3 class="LC20lb MBeuO DKV0Md">Python Release Python 3.12.3</h3>
<h3 class="LC20lb MBeuO DKV0Md">O'Reilly Japan - Pythonクイックリファレンス 第4版</h3>
<h3 class="LC20lb MBeuO DKV0Md">Pythonとは？開発に役立つ使い方、トレンド記事やtips</h3>
<h3 class="LC20lb MBeuO DKV0Md">新・Python入門編 | プログラミング学習サイト【paiza ...</h3>

まとめ

動的サイトのスクレイピングを行う方法として，SeleniumとBeautifulSoupを組み合わせた手法を試しました．
ちなみに今回のケースは，クエリパラメータを設定してhttps://google.com?q=Pythonにリクエストを送ることで静的サイトのスクレイピングと同様に行えます．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up