More than 3 years have passed since last update.

COCO'S朝食バイキング実施店舗一覧PDFをCSV変換

Posted at 2020-10-01

はじめに

COCO'S朝食バイキング実施店舗一覧PDFファイルを取得してCSVにするを
参考にcamelotでPDF変換、PandasでデータクレンジングをしてCSVに変換しました

準備

apt install python3-tk ghostscript
pip install camelot-py[cv]
pip install pandas

データクレンジング

import camelot
import pandas as pd

tables = camelot.read_pdf(
    "https://www.cocos-jpn.co.jp/menu_pdf/bvshoplist.pdf",
    pages="all",
    split_text=True,
    strip_text="\n",
    line_scale=40,
)

# 列名
columns = ["".join(i) for i in zip(*(tables[0].df.head(2).values))]

dfs = [table.df.iloc[3:].set_axis(columns, axis=1) for table in tables]

# 番号振り直し
df = pd.concat(dfs).reset_index(drop=True)
df.index += 1

# 空文字を欠損に置換
df.mask(df == "", inplace=True)

# 実施日が毎日の場合、平日、土日
df["実施日"] = df["実施日"].where(df["ご利用料金"].isnull(), df["ご利用料金"])

# 毎日の店舗情報を補完
df.fillna(method="ffill", inplace=True)

# ご利用料金の列を削除
df.drop("ご利用料金", axis=1, inplace=True)

# 税込金額
adult = (
    df["大人"]
    .str.extractall("([0-9]+)")
    .unstack()
    .rename(columns={0: "大人_税抜", 1: "大人_税込"}, level=1)
)
adult.columns = adult.columns.droplevel(level=0)
df["大人"] = adult["大人_税込"].astype(int)

# 税込金額
child = (
    df["小学生以下"]
    .str.extractall("([0-9]+)")
    .unstack()
    .rename(columns={0: "小人_税抜", 1: "小人_税込"}, level=1)
)
child.columns = child.columns.droplevel(level=0)
df["小学生以下"] = child["小人_税込"].astype(int)

# 住所の列名変更
df.rename(columns={"住所字以降": "住所"}, inplace=True)

# 住所をユニコード正規化、空白除去
df["住所"] = df["住所"].str.normalize("NFKC").str.replace(" ", "")

df.to_csv("cocos.csv", encoding="utf_8_sig")

参考

COCO'S朝食バイキング実施店舗一覧PDFファイルを取得してCSVにする

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up