More than 3 years have passed since last update.

camelotで大量のページのPDFをCSV変換

Posted at 2021-08-26

Colaboratoryでcamelotを利用してPDFからCSVに変換していると200ページ超えたあたりからメモリ不足で落ちます

対策として50ページぐらいに分割して変換してから結合

川崎市の陽性者一覧のPDFを利用

# ダウンロードURL
!wget https://www.city.kawasaki.jp/350/cmsfiles/contents/0000116/116827/4.pdf -O data.pdf

!apt update
!apt install python3-tk ghostscript
!pip install camelot-py[cv]

import camelot
import pandas as pd

from tqdm.notebook import tqdm
from more_itertools import chunked

# ページリスト取得
handler = camelot.handlers.PDFHandler("data.pdf")
pages = handler._get_pages(pages="all")

# ページ範囲のリスト作成
pages_list = [str(i[0]) if i[0] == i[-1] else f"{i[0]}-{i[-1]}" for i in chunked(pages, 50)]
pages_list

dfs = []

for page in tqdm(pages_list):

    tables = camelot.read_pdf(
        "data.pdf",
        pages=page,
        split_text=True,
    )

    for table in tables:
        dfs.append(pd.DataFrame(table.data[1:], columns=table.data[0]))

df = pd.concat(dfs).reset_index(drop=True)

df.to_csv("data.csv")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up