LoginSignup
0
0

More than 3 years have passed since last update.

新型コロナウイルスに有効な界面活性剤が含まれている製品リストのPDFをCSVに変換

Posted at

独立行政法人製品評価技術基盤機構の新型コロナウイルスに有効な界面活性剤が含まれている製品リストのPDFをCSVに変換

apt install python3-tk ghostscript
pip install camelot-py[cv]

スクレイピング

  • 最新のPDFをスクレイピング
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

url = "https://www.nite.go.jp/information/osirasedetergentlist.html"

r = requests.get(url)
r.raise_for_status()

soup = BeautifulSoup(r.content, "html.parser")

tag = soup.select_one("div.main div.cf ul > li > a")

link = urljoin(url, tag.get("href"))

データラングリング

import camelot
import pandas as pd

tables = camelot.read_pdf(
    link, pages="all", split_text=True, line_scale=40, copy_text=["v"]
)

df_tmp = pd.concat([table.df for table in tables[:-1]])

# 住宅家具用洗剤など

df1 = df_tmp.iloc[1:].set_axis(df_tmp.iloc[0].to_list(), axis=1).reset_index(drop=True)
df1.index += 1
df1.to_csv("housing.csv", encoding="utf_8_sig")

# 台所用合成洗剤など

df2 = tables[-1].df.iloc[1:].set_axis(tables[-1].df.iloc[0].to_list(), axis=1)
df2.to_csv("kitchen.csv", encoding="utf_8_sig")
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0