PandasでWeb上の表データを取得する

Posted at 2025-05-27

表のスクレイピングに困ったことはありますか？

Webサイトで以下のような表があった場合、PythonのライブラリであるPandasを利用することで簡単にDataFrameとして取得することができます。

名前	年齢	出身
佐藤	24	埼玉
鈴木	53	神奈川

今回は、Pandasのread_html関数を使って、Web上のHTMLテーブルをサクッと取得する方法を紹介します。

🎯 `read_html`って何？

read_htmlはPandasの関数で、HTML内にある<table>タグの内容を、DataFrame形式で簡単に取得できます。

たった数行で記述できます。

import pandas as pd

tables = pd.read_html("{取得したいWebサイトのURL}")

💡 基本の使い方

read_htmlの基本的な引数は次の通りです。

pd.read_html(
    io,                 # URLやHTML文字列、ローカルのHTMLファイル
    match=None,         # 特定の文字列を含むテーブルだけ取得する場合
    flavor=None,        # "lxml" or "bs4" (パーサーを指定)
    header=None,        # ヘッダー行の指定
    index_col=None,     # インデックスに使う列の指定
    attrs=None,         # 特定の属性を持つテーブルのみ取得
)

🌸 その他のできること（一例）

特定のテーブルだけ取得したい！

例えば、「人口」という文字列を含むテーブルだけ欲しいときは、match引数を使います。

tables = pd.read_html(url, match="人口")

ローカルのHTMLファイルから読み込む

ローカルファイルのパスを指定するだけでOKです。

tables = pd.read_html("sample.html")

`<table>`の属性を指定する

HTML内でクラスやIDを指定してテーブルを抽出できます。

tables = pd.read_html(url, attrs={"class": "wikitable"})

🐛 よくあるエラーと対処法

エラー内容	原因	解決方法
`ImportError: lxml not found`	lxmlがインストールされていない	`pip install lxml` または `pip install beautifulsoup4`
`ValueError: No tables found`	対象のhtmlにテーブルが存在しない	URLが正しいか確認 / `attrs`や`match`を見直す
`ParserError: could not parse...`	パーサーがうまく動作しない	`flavor`引数で`bs4`や`lxml`を指定して試す

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up