More than 1 year has passed since last update.

【Python】JavaScriptレンダリングページからhtmlのtableをスクレイピングする（Playwright→BeautifulSoup4→pandas）

Last updated at 2022-12-03Posted at 2022-03-12

1. はじめに

Pythonのpandasでread_html()だけを使いWebスクレピングしてると、必ず邪魔をしてくるJavaScriptレンダリングページ。
そんな時、データ処理はpandasにまかせて、htmlのtableデータ取得はPlaywright、BeautifulSoup4を必要最低限だけ利用してスクレイピングする方法です。

2. 環境

Windows10 Pro：21H2
Python：3.10.5

3. インストールするライブラリ

playwright：ブラウザ操作（Selenium→puppeteer→Playwrightときて、これが一番安定）
beautifulsoup4：htmlパーサー
lxml：htmlパーサー(beautifulsoup4で利用)
pandas：データ解析ライブラリ

pip install pandas 
pip install beautifulsoup4
pip install lxml
pip install playwright
playwright install

4. サンプルコード

GitHub

get_html.py

import pandas as pd
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright

# playwright→BeautifulSoup4→pandasでHTMLのtableを読み込む
def get_html(playwright):
 try:
  # playwright
  browser = playwright.chromium.launch()
  context  = browser.new_context()
  page = context.new_page()
  page.goto("https://スクレイピングしたいWebサイトのURL")
  html = page.content()
  # BeautifulSoup4
  soup = BeautifulSoup(html, "lxml")
  tables = soup.find_all("table")
  # pandas
  dfs = pd.read_html(str(tables))

 except:
  #エラー処理を書く（HTTPエラー、tableタグなし等）
  print("Error")

 else:
  # pandasのデータフレーム処理を書く
  print(dfs)

 finally:
  page.close()
  context.close()
  browser.close()

# get_html関数呼び出し
with sync_playwright() as playwright:
  get_html(playwright)

5. 補足

PuppeteerからPlaywrightに変更し非同期処理をやめたら下記エラーは発生しなくなった

警告
ConnectionResetError: [WinError 10054] 既存の接続はリモートホストに強制的に切断されました。

6. 参考にしたサイト

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up