6
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

pdfplumberでPDFからCSVに変換

Last updated at Posted at 2020-11-06

pdfplumber

camelotで点線を実線として処理する(ハフ変換)
https://qiita.com/barobaro/items/af850ac29dbc983eb39b

こちらでもcamelotは実線以外の表の抽出が苦手
pdfplumberだと簡単に抽出できるみたい

変換できなかった

Go To EAT事業公式サイト 滋賀県
労働基準関係法令違反に係る公表事案
文字が認識しない、camelotだと抽出可能

変換できた

電話や情報通信機器を用いて診療を実施する医療機関の一覧

電話や情報通信機器を用いて診療を実施する医療機関の一覧(兵庫県)

wget https://www.mhlw.go.jp/content/000691131.pdf -O data.pdf
pip install pdfplumber
import pdfplumber
import pandas as pd

with pdfplumber.open("data.pdf") as pdf:

    dfs = []

    for page in pdf.pages:

        data = page.extract_table()
        df_tmp = pd.DataFrame(data[2:], columns=data[1])

        dfs.append(df_tmp)

df = pd.concat(dfs)

df.to_csv("hyogo.csv", encoding="utf_8_sig")

千葉県のGo To EaTのPDF

wget https://www.chiba-gte.jp/downloads/store_list.pdf -O data.pdf
import pdfplumber
import pandas as pd

with pdfplumber.open("data.pdf") as pdf:

    dfs = []

    for page in pdf.pages:

        data = page.extract_table()
        df_tmp = pd.DataFrame(data)

        dfs.append(df_tmp)

df = pd.concat(dfs)

df1 = df.mask(df.isna() | (df == "")).dropna(thresh=4)

df2 = df1[df1[0] != ""].reset_index(drop=True)

df2.set_axis(["", "電子", "店舗名", "住所", "TEL"], axis=1, inplace=True)

df2.index += 1

df2.to_csv("data.csv")
6
7
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
6
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?