Pythonで愛媛県感染症情報センターの新型コロナウイルスの患者報告数の週推移を取得する

Last updated at 2024-11-08Posted at 2024-07-23

愛媛県感染症情報センターの新型コロナウイルス・インフルエンザの患者報告数の週推移を取得する

シンプルな表なのでpandasのread_htmlで変換

import pandas as pd

url = "https://www.pref.ehime.jp/site/kanjyo/39800.html"

df = pd.read_html(url, match="患者報告数の週推移")[0]
df

結果を確認

なぜか保健所名、保健所名.1、保健所名.2と月が複数回表示される

ブラウザで確認すると

特に問題なく表示

htmlを見てみると

<tr>
<th colspan="37" style="height:auto; text-align:center; width:auto">
<p>保健所名</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>愛媛県</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>四国中央</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>西　条</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>今　治</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>松山市</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>中　予</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>八幡浜</p>
</th>
<th style="height:auto; text-align:center; width:auto">
<p>宇和島</p>
</th>
</tr>
<tr>
<td colspan="36" style="height:auto; text-align:center; width:auto">7月</td>
<td style="height:auto; text-align:center; width:auto">第28週</td>
<td style="height:auto; text-align:center; width:auto">781</td>
<td style="height:auto; text-align:center; width:auto">73</td>
<td style="height:auto; text-align:center; width:auto">112</td>
<td style="height:auto; text-align:center; width:auto">75</td>
<td style="height:auto; text-align:center; width:auto">263</td>
<td style="height:auto; text-align:center; width:auto">75</td>
<td style="height:auto; text-align:center; width:auto">60</td>
<td style="height:auto; text-align:center; width:auto">123</td>
</tr>

thのcolspanが37、tdのcolspanが36となっておりセル結合されているため
見た目はthがセルを結合して１つ、tdは月と週の２つがそれぞれ表示されるため
ブラウザで確認してもわからないがpandasでtableを変換すると保健所名が複数個表示される原因になっている

実は前々からこのような状況で、2012年は17、2013年～2019年まで35が続いて、
2020年～2021年は新型コロナウイルスでインフルエンザ減少のためページがなし
そのあと2022年～2023年は36、2024年は37に順調にcolspanがカウントアップしています

まとめたのがこちら

タイトル	URL	COLSPAN	備考
インフルエンザ患者報告数（2009/2010シーズン）	https://www.pref.ehime.jp/site/kanjyo/6710.html	2
インフルエンザ患者報告数（2010/2011シーズン）	https://www.pref.ehime.jp/site/kanjyo/6713.html	2
インフルエンザ患者報告数（2011/2012シーズン）	https://www.pref.ehime.jp/site/kanjyo/6717.html	2
インフルエンザ患者報告数（2012/2013シーズン）	https://www.pref.ehime.jp/site/kanjyo/6721.html	17	フォーマット変更
インフルエンザ患者報告数（2013/2014シーズン）	https://www.pref.ehime.jp/site/kanjyo/6725.html	35
インフルエンザ患者報告数（2014/2015シーズン）	https://www.pref.ehime.jp/site/kanjyo/6729.html	35
インフルエンザ患者報告数（2015/2016シーズン）	https://www.pref.ehime.jp/site/kanjyo/6733.html	35
インフルエンザ患者報告数（2016/2017シーズン）	https://www.pref.ehime.jp/site/kanjyo/6737.html	35
インフルエンザ患者報告数（2017/2018シーズン）	https://www.pref.ehime.jp/site/kanjyo/6741.html	35
インフルエンザ患者報告数（2018/2019シーズン）	https://www.pref.ehime.jp/site/kanjyo/6745.html	35
インフルエンザ患者報告数（2019/2020シーズン）	https://www.pref.ehime.jp/site/kanjyo/6749.html	35
インフルエンザ患者報告数（2022/2023シーズン）	https://www.pref.ehime.jp/site/kanjyo/6753.html	36
インフルエンザ患者報告数（2023/2024シーズン）	https://www.pref.ehime.jp/site/kanjyo/6757.html	36
新型コロナウイルス感染症患者報告数（2024年）	https://www.pref.ehime.jp/site/kanjyo/39800.html	37
定点からのインフルエンザ患者報告数（2024/2025シーズン）	https://www.pref.ehime.jp/site/kanjyo/91360.html	2

このままだと不要な列がそのまま変換されてしまうため

beautifulsoupでthのcolspanを2に修正、tdのcolspanを削除してからpandasでtable取得する

import requests
from bs4 import BeautifulSoup

url = "https://www.pref.ehime.jp/site/kanjyo/39800.html"

r = requests.get(url)
r.raise_for_status()

soup = BeautifulSoup(r.content, "html.parser")

tag_table = soup.find("p", string="患者報告数の週推移").find_parent("table")

# th colspan 2
tag_table.select_one('th[colspan="37"]')["colspan"] = "2"

# td colspan del
for td in tag_table.select('td[colspan="36"]'):
    del td["colspan"]

import pandas as pd

df = pd.read_html(tag_table.prettify())[0]

df.columns = df.columns.str.replace("\s", "", regex=True)

df.rename(columns={"保健所名": "月", "保健所名.1": "週"}, inplace=True)

df

なぜかcolspanが増えていく「患者報告数の週推移」のtableの話でした

2024/11/08　追記

定点からのインフルエンザ患者報告数（2024/2025シーズン）からcolspanが2に修正されました

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up