More than 3 years have passed since last update.

厚生労働省のウェブサイトに掲載を希望した無痛分娩取扱施設の一覧をCSVに変換

Last updated at 2021-09-06Posted at 2021-09-05

変換済み

PDFから変換しただけで内容は未確認、都道府県別の施設数は確認済み

下記プログラムで変換後以下手作業で修正

聖路加国際病院のURLが一部枠外だったので手で修正
聖路加国際病院のすぐ下の行が空行だったので削除
連番ふり直し

!apt install python3-tk ghostscript
!pip install camelot-py[cv]

import pathlib
from urllib.parse import urljoin

import camelot
import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}


def fetch_soup(url, parser="html.parser"):

    r = requests.get(url, headers=headers)
    r.raise_for_status()

    soup = BeautifulSoup(r.content, parser)

    return soup


def fetch_file(url, name, dir="."):

    p = pathlib.Path(dir, f"{name}.pdf")
    p.parent.mkdir(parents=True, exist_ok=True)

    r = requests.get(url)
    r.raise_for_status()

    with p.open(mode="wb") as fw:
        fw.write(r.content)
    return p


url = "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000186912.html"

soup = fetch_soup(url)

dfs = []

for i in soup.select("ul.m-listLink--hCol2 > li > a"):

    pref = i.get_text()

    link = urljoin(url, i.get("href"))

    p = fetch_file(link, pref)

    tables = camelot.read_pdf(str(p), pages="all", split_text=True, strip_text=" \n")

    for table in tables:

        tmp = pd.DataFrame(table.data[2:], columns=table.data[1])
        tmp["都道府県"] = pref

        dfs.append(tmp)

df0 = pd.concat(dfs).reset_index(drop=True)
df0

df1 = df0.copy()

df1.columns

df1[~df1["医師の人数"].str.isnumeric()]

df1["ウェブサイトURL"] = df1["ウェブサイトURL"].mask(
    df1["ウェブサイトURL"].str.startswith("www."), "http://" + df1["ウェブサイトURL"]
)

df1["ウェブサイトURL"] = df1["ウェブサイトURL"].mask(df1["ウェブサイトURL"].str.endswith("なし"))

df1["ウェブサイトURL"] = df1["ウェブサイトURL"].str.replace(",", ".")

df1.to_csv("list.csv", encoding="utf_8_sig")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up