tabula.read_pdf使い方

tabula.read_pdf

Last updated at 2024-05-28Posted at 2024-05-28

tabula.read_pdfとは
PythonのモジュールでPDFファイルから表を抽出する事ができます。他にもPDFからを読み取るモジュールはありますがtabulaは表の抽出に特化しているらしいです（Dataframe形式で抜き出してくれる）。ただしJavaで開発されているためJavaのインストールが必要です。

環境
OS：Windows10

使用方法
tabulaインストール
コマンドプロンプトを開きtabulaをインストールします。

$ pip install tabula-py

tabula.read_pdf使用方法

import pandas as pd
import tabula

data = tabula.read_pdf("PDFのパス", lattice=True, pages='all', pandas_options={'header':None})
print(data)

copy
tabula.read_pdf("PDFのパス")でPDFを読み込みます。若干オプションを付けていますが不要であれば削除してください。

表が複数ある場合

PDFの中に表が複数ある場合はリスト形式にして抜き出してくれます。例えばこうです。

リスト[0]→一つ目の表(dataframe型)
リスト[1]→二つ目の表(dataframe型)

こうすればPDFから表を抜き出すことができます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up