More than 1 year has passed since last update.

PythonでPDFの表をExcelの表へ変換する

Last updated at 2023-06-06Posted at 2023-05-29

前提条件

Windows 10
Python3
Java
pandas
tabula-py

tabula-pyの公式ドキュメントは以下です。
https://tabula-py.readthedocs.io/en/latest/tabula.html

pandasの公式ドキュメントは下記です。
https://pandas.pydata.org/docs/

tabula-pyを利用するためにはJavaをインストールする必要があります。
~~Javaの環境構築について、ここでは割愛します。~~

Javaの環境構築について、以下の記事を書きました。

サンプルのPDFは以下を使用します。
https://www.mhlw.go.jp/content/10906000/001094070.pdf

目的

tabula-pyを利用して、PDFの表を抽出し、Excelへ表として書き出す。

環境構築

pip install pandas tabula-py

サンプル

pdftable2excel.py

# -*- coding: utf-8 -*-

import pandas as pd
from tabula import read_pdf

dfs = read_pdf("https://www.mhlw.go.jp/content/10906000/001094070.pdf",	lattice = True)
excel = 'C:\\pdf\\新型コロナウイルス陽性者数とPCR検査等実施人数.xlsx'

with pd.ExcelWriter(excel) as writer:
# エクセルのシートにPDFの表を出力
    for i, df in enumerate(dfs):
        df.to_excel(writer, sheet_name = str(i))

csvへ追記保存する場合

pdftable2csv.py

# -*- coding: utf-8 -*-

from tabula import read_pdf

dfs = read_pdf("https://www.mhlw.go.jp/content/10906000/001094070.pdf",	lattice = True)
csv = 'C:\\pdf\\新型コロナウイルス陽性者数とPCR検査等実施人数.csv'

for df in dfs:
    df.to_csv(csv, mode = 'a', encoding = 'shift-jis', index = False, header = False)

きっかけ

PDFにある表を抽出して、Excelとして保存し、2次利用したいことがあったので、本プログラムを書いてみた。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

PythonでPDFの表をExcelの表へ変換する

前提条件

目的

環境構築

サンプル

関連資料

きっかけ