Python の pdfplumber で PDF ファイルを解析

Last updated at 2025-11-24Posted at 2025-11-24

やりたいこと

Python で PDF ファイルを解析してテキスト要素などを抽出する。

対象ファイル: 消費動向調査結果の概要
https://www.esri.cao.go.jp/jp/stat/shouhi/gaiyou.pdf

pdfplumber

ライブラリは pdfplumber を利用する。

インストール

pip install pdfplumber

プログラム例

import sys
import pdfplumber

# 全要素を取得
def output_all(pdf_file):
    with pdfplumber.open(pdf_file) as pdf:
        for page in pdf.pages:
            print(f"annots: {page.annots}\n")
            print(f"texts: {page.extract_text()}\n")
            print(f"words: {page.extract_words()}\n")
            print(f"chars: {page.chars}\n")
            print(f"tables: {page.extract_tables()}\n")
            print(f"lines: {page.lines}\n")
            print(f"rects: {page.rects}\n")
            print(f"images: {page.images}\n")
            print(f"hyperlinks: {page.hyperlinks}\n")

pdf_file = sys.argv[1]
output_all(pdf_file)

実行結果（一部）

text

page.extract_text() で位置情報を含まないテキスト情報のみを取得する。

texts: 報道資料
消費動向調査（令和７ (2025)年10月実施分）
結果の概要
＜消 費者マインド＞

words

page.extract_words() で位置情報とともに連続したテキストを取得する。

words: [
{'text': '報道資料', 'x0': 53.4, 'top': 62.88347999999996, 'width': 56.160000000000004, 'height': 14.040000000000077, ...},
{'text': '消費動向調査（令和７', 'x0': 177.96, 'top': 93.56399999999996, 'width': 120.47999999999999, 'height': 12.0, ...},
...
]

上記は見やすさのため改行している。

chars

page.chars で位置情報とともに文字単位でテキストを取得する。

chars: [
{
 'object_type': 'char',
 'page_number': 1,
 'x0': 298.92, 'y0': 25.91808, 'x1': 304.79136, 'y1': 36.47808,
 'text': '1',
 'matrix': (10.56, 0.0, 0.0, 10.56, 298.92, 29.16),
 'fontname': 'BVTPJW+Century',
 'width': 5.8713599999999815, 'height': 10.559999999999999, 'size': 10.559999999999999,
 'top': 805.44192, 'bottom': 816.0019199999999, 'doctop': 805.44192
 ...
},
...
]

上記は見やすさのため改行している。

tables

page.extract_tables() でテーブルのテキストを取得することができる。

tables: [
[['から見た10月の消費者マインドは、持ち直している。...', ...],
 ['二人以上の世帯、季節調整値）', ...],
 ...
],
...
]

上記は見やすさのため改行している。

lines

page.lines で線を取得することができる。

lines: [
{'object_type': 'line',
 'page_number': 1,
 'x0': 63.23153740615001, 'y0': 518.86018180672, 'x1': 522.98285490615, 'y1': 518.86018180672,
 'width': 459.7513175, 'height': 0.0,
 'pts': [(63.23153740615001, 323.05981819327997), (522.98285490615, 323.05981819327997)],
 'linewidth': 0.0, 'stroke': True, 'fill': False, 'evenodd': False, 'stroking_color': 0, 'non_stroking_color': 0.0,
 'tag': 'Artifact', ...},
...
]

rects

page.rects で矩形を取得することができる。

rects: [
{'object_type': 'rect',
 'page_number': 1,
 'x0': 45.8, 'y0': 759.22, 'x1': 117.0, 'y1': 784.47,
 'width': 71.2, 'height': 25.25,
 'tag': 'TextBox',
 ...
},
...
]

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up