はじめに

サイト，またはhtmlファイルにある表をpandasのDataFrameで取得したいとき，わざわざ頑張ってスクリプトを書かなくてもpandasを使うとめちゃめちゃ便利です．

required

pip install pandas beautifulsoup4 requests lxml

綺麗な表のスクレイピング

今回はwikipediaのPythonを対象としますが，普通のサイトでしたら表は綺麗なはずです．（ここでいう'綺麗'の意味は後述）

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml') # lxml or html.parser

# テーブルを全てデータフレーム化
df_list = [pd.read_html(str(table))[0] for table in soup.find_all('table')]

print('表の数: %d' % len(df_list))  # 表の数: 8

df_list[2]

[out]


	Type	mutable	Description	Syntax example
0	bool	immutable	Boolean value	TrueFalse
1	bytearray	mutable	Sequence of bytes	bytearray(b'Some ASCII')bytearray(b"Some ASCII...
2	bytes	immutable	Sequence of bytes	b'Some ASCII'b"Some ASCII"bytes([119, 105, 107...
3	complex	immutable	Complex number with real and imaginary parts	3+2.7j
4	dict	mutable	Associative array (or dictionary) of key and v...	{'key1': 1.0, 3: False}
5	ellipsisa	immutable	An ellipsis placeholder to be used as an index...	...Ellipsis
6	float	immutable	Floating point number, system-defined precision	3.1415927
7	frozenset	immutable	Unordered set, contains no duplicates; can con...	frozenset([4.0, 'string', True])
8	int	immutable	Integer of unlimited magnitude[82]	42
9	list	mutable	List, can contain mixed types	[4.0, 'string', True]
10	NoneTypea	immutable	An object representing the absence of a value.	None
11	NotImplementedTypea	immutable	A placeholder that can be returned from overlo...	NotImplemented
12	set	mutable	Unordered set, contains no duplicates; can con...	{4.0, 'string', True}
13	str	immutable	A character string: sequence of Unicode codepo...	'Wikipedia'"Wikipedia""""Spanningmultiplelines"""
14	tuple	immutable	Can contain mixed types	(4.0, 'string', True)

めちゃめちゃ簡単に表のパースができます

参考

そもそものhtmlをpd.read_html()してデータフレームのリストを作ることも可能ですが，一旦find_all('table')してからその文字列をpd.read_html()した方が処理時間が圧倒的に早かったです．

綺麗ではない表のスクレイピング

pdfなどを変換ツールを用いて無理やりhtmlに変換した場合，そのままpd.read_html()するとおかしな挙動になります

例

html =\
"""
<html>
    <body>
        <table style="border-collapse:collapse" cellspacing="0">
            <tbody>
                <tr style="height:13pt">
                    <td style="width:22pt;border-top-style:solid;border-top-width:1pt;border-left-style:solid;border-left-width:1pt;border-bottom-style:solid;border-bottom-width:1pt">
                        <p class="s5" style="text-indent: 0pt;line-height: 12pt;text-align: center;">特</p>
                    </td>
                    <td style="width:52pt;border-top-style:solid;border-top-width:1pt;border-bottom-style:solid;border-bottom-width:1pt">
                        <p class="s5" style="padding-left: 6pt;text-indent: 0pt;line-height: 12pt;text-align: left;">定 資</p>
                    </td>
                    <td style="width:62pt;border-top-style:solid;border-top-width:1pt;border-bottom-style:solid;border-bottom-width:1pt">
                        <p class="s5" style="text-indent: 0pt;line-height: 12pt;text-align: left;">産 の 種</p>
                    </td>
                    <td style="width:22pt;border-top-style:solid;border-top-width:1pt;border-bottom-style:solid;border-bottom-width:1pt;border-right-style:solid;border-right-width:1pt">
                        <p class="s5" style="padding-right: 5pt;text-indent: 0pt;line-height: 12pt;text-align: right;">類</p>
                    </td>
                    <td style="width:261pt;border-top-style:solid;border-top-width:1pt;border-left-style:solid;border-left-width:1pt;border-bottom-style:solid;border-bottom-width:1pt;border-right-style:solid;border-right-width:1pt">
                        <p class="s5" style="padding-left: 5pt;text-indent: 0pt;line-height: 12pt;text-align: left;">不動産信託受益権</p>
                    </td>
                </tr>
                <tr style="height:13pt">
                    <td style="width:22pt;border-top-style:solid;border-top-width:1pt;border-left-style:solid;border-left-width:1pt;border-bottom-style:solid;border-bottom-width:1pt">
                        <p class="s5" style="text-indent: 0pt;line-height: 12pt;text-align: center;">取</p>
                    </td>
                    <td style="width:52pt;border-top-style:solid;border-top-width:1pt;border-bottom-style:solid;border-bottom-width:1pt">
                        <p class="s5" style="padding-left: 6pt;text-indent: 0pt;line-height: 12pt;text-align: left;">得 予</p>
                    </td>
                    <td style="width:62pt;border-top-style:solid;border-top-width:1pt;border-bottom-style:solid;border-bottom-width:1pt">
                        <p class="s5" style="text-indent: 0pt;line-height: 12pt;text-align: left;">定 年 月</p>
                    </td>
                    <td style="width:22pt;border-top-style:solid;border-top-width:1pt;border-bottom-style:solid;border-bottom-width:1pt;border-right-style:solid;border-right-width:1pt">
                        <p class="s5" style="padding-right: 5pt;text-indent: 0pt;line-height: 12pt;text-align: right;">日</p>
                    </td>
                    <td style="width:261pt;border-top-style:solid;border-top-width:1pt;border-left-style:solid;border-left-width:1pt;border-bottom-style:solid;border-bottom-width:1pt;border-right-style:solid;border-right-width:1pt">
                        <p class="s5" style="padding-left: 5pt;text-indent: 0pt;line-height: 12pt;text-align: left;">平成 <span class="s6">** </span>年 <span class="s6">** </span>月 <span class="s6">** </span>日（予定）</p>
                    </td>
                </tr>
            </tbody>
        </table>
    </body>
</html>
"""
soup = BeautifulSoup(html, "lxml") # lxml or html.parser
# テーブルを全てデータフレーム化
df_list = [pd.read_html(str(table))[0] for table in soup.find_all('table')]

print('表の数: %d' % len(df_list))  # 表の数: 1

df_list[0]

[out]


	0	1	2	3	4
0	特	定資	産の種	類	不動産信託受益権
1	取	得予	定年月	日	平成年月 ** 日（予定）

変換ツールでhtmlに変換するとこのような文字列分割のバグ(?)がよく起こります．

ただ元のhtmlをブラウザでみてみるとちゃんと罫線が正しい位置にあることを確認できるかと思います．(リンクは，dropboxにログインしていないとただしく見えないと思います)

htmlをもう一度よくみてみると．

<td style="width:22pt;border-top-style:solid;border-top-width:1pt;border-bottom-style:solid;border-bottom-width:1pt;border-right-style:solid;border-right-width:1pt">

のように，'border-hogehoge-style'という文字列が入っています．

このhogehogeがtopだったら上の罫線あり，leftだったら左の罫線ありということになります．

バグ対応

これを考慮して，

tdのstyleに'border-left-style'が入ってなかったら，左のtdに文字列をマージしてhtmlを書き換える

というロジックでスクリプトを書くと以下になります

import bs4
from bs4 import BeautifulSoup
import pandas as pd

def correct_table(table: bs4.element.Tag):
    """
    【破壊関数】
    引数のtableタグに左側罫線のないセルがあった場合にのみ動く
        * あった場合は無理やりその左のセルに中の文字列をくっつける
    """
    while(1):
        break_flg = 0
        for i, row in enumerate(table.find_all("tr")):
            for j, cell in enumerate(row.find_all(["td", "th"])):
                if j > 0:
                    """ 各行一番左のセルは確実に左側罫線があるのでそれ以外を考える """
                    # index取得
                    current_cell_index = table.find_all().index(cell)
                    one_left_cell_index = table.find_all().index(row.find_all(["td", "th"])[j - 1]) if j!=0 else None
                    # style取得(もしstyleアトリビュートがなければ'border-left-style'というstyleであるとする)
                    current_cell_style = cell.get("style") or "border-left-style"
                    # colspan取得
                    current_cell_colspan = int(cell.get("colspan")) if cell.get("colspan") else 1
                    one_left_cell_colspan = int(table.find_all()[one_left_cell_index].get("colspan")) if table.find_all()[one_left_cell_index].get("colspan") else 1
                    # セルの左側に罫線があるかどうかの判定
                    left_ruled_line_exist = "border-left-style" in current_cell_style

                    if not left_ruled_line_exist:
                        """ セルの左側に罫線がない場合の処理 """
                        one_left_cell_value = table.find_all()[one_left_cell_index].text
                        table.find_all()[one_left_cell_index]["colspan"] = current_cell_colspan + one_left_cell_colspan
                        table.find_all()[one_left_cell_index].string =\
                            table.find_all()[one_left_cell_index].text + table.find_all()[current_cell_index].text
                        cell.extract() #いまのセルを削除する
                        break_flg = 1
                        break
        if break_flg == 0:
            """ セルの左側に罫線がないセルが一つもなくなったらwhile文からbreakする """
            break
def read_table_from_soup(soup: BeautifulSoup) -> list:
    """ soupから表をパースしてデータフレームのリストを返す """
    [correct_table(table) for table in soup.find_all('table')]
    df_list = [pd.read_html(str(table))[0] for table in soup.find_all('table')]
    return df_list

if __name__ == '__main__':
    # html変数は先ほどと同じものを使う
    soup = BeautifulSoup(html, "lxml") # lxml or html.parser
    df_list = read_table_from_soup(soup)
    print('表の数: %d' % len(df_list))  # 表の数: 1

    df_list[0]

[out]


	0	1	2	3	4
0	特定資産の種類	特定資産の種類	特定資産の種類	特定資産の種類	不動産信託受益権
1	取得予定年月日	取得予定年月日	取得予定年月日	取得予定年月日	平成年月 ** 日（予定）

このように，罫線の位置で判別してpandas.DataFrameにパースできます

おわりに

おそらくもっと賢い書き方があると思うのでコメントいただければ修正します。

【Python3】表のスクレイピング(罫線の位置も考慮)

はじめに

required

綺麗な表のスクレイピング

参考

綺麗ではない表のスクレイピング

例

バグ対応

おわりに