2
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

camelotで大量のページのPDFをCSV変換

Posted at

Colaboratoryでcamelotを利用してPDFからCSVに変換していると200ページ超えたあたりからメモリ不足で落ちます

対策として50ページぐらいに分割して変換してから結合

川崎市の陽性者一覧のPDFを利用

# ダウンロードURL
!wget https://www.city.kawasaki.jp/350/cmsfiles/contents/0000116/116827/4.pdf -O data.pdf
!apt update
!apt install python3-tk ghostscript
!pip install camelot-py[cv]
import camelot
import pandas as pd

from tqdm.notebook import tqdm
from more_itertools import chunked

# ページリスト取得
handler = camelot.handlers.PDFHandler("data.pdf")
pages = handler._get_pages(pages="all")

# ページ範囲のリスト作成
pages_list = [str(i[0]) if i[0] == i[-1] else f"{i[0]}-{i[-1]}" for i in chunked(pages, 50)]
pages_list

dfs = []

for page in tqdm(pages_list):

    tables = camelot.read_pdf(
        "data.pdf",
        pages=page,
        split_text=True,
    )

    for table in tables:
        dfs.append(pd.DataFrame(table.data[1:], columns=table.data[0]))

df = pd.concat(dfs).reset_index(drop=True)

df.to_csv("data.csv")
2
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?