More than 3 years have passed since last update.

東京都の国のステージ判断のための指標のPDFからCSV作成（camelot）

Posted at 2021-08-26

pdfplumber
https://qiita.com/barobaro/items/75d076f4fbe9771a0b3a

Twitterにcamelotだと変換できないと書いてたのでcamelotで作成

latticeだと「process_background=True」で取得はできるが、表になってないので加工が必要　※めんどくさい

streamで変換

camelotの範囲指定しても範囲絞り込めない
デフォルトの「edge_tol=50」だと取り込み範囲が表以外も含まれるため+10してみると「edge_tol=60」表の範囲が取得できた
行間が大きいので「row_tol=40」で調整、10から+10していっただけ
「病床のひっ迫具合」が「入院率」と結合されてしまっているのでPython版は修正

ダウンロード

!wget https://www.fukushihoken.metro.tokyo.lg.jp/iryo/kansen/corona_portal/info/kunishihyou.files/kuni0824.pdf -O data.pdf

コマンド

camelot -p 1 -o data.csv -f csv stream -e 60 -r 40 data.pdf

Python

import camelot

tables = camelot.read_pdf(
    "data.pdf", flavor="stream", edge_tol=60, row_tol=40, strip_text=" \n"
)

df = tables[0].df

df.iat[6, 0] = "入院率"

df

df.to_csv("data.csv", index=False, header=False)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up