More than 1 year has passed since last update.

PythonでPDFのテキスト毎の座標を取得する

Last updated at 2023-07-06Posted at 2023-07-05

結論

pdfminerを使用し、各テキストのレイアウトを取得することで座標が取得できます。

前提

記事を書いたときのPythonとpdfminer.sixのバージョンです

名前	バージョン
Python	3.10.12
pdfminer.six	20221105

準備

以下のサンプルコードを実行するためにpdfminer.sixをインストールしておきます。コンソールで下記のコマンドを実行します。

pip install pdfminer.six

サンプルコード

.py

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams, LTTextContainer
from pdfminer.converter import PDFPageAggregator

def main():
    manager = PDFResourceManager()

    with open('sample.pdf', 'rb') as input:
        with PDFPageAggregator(manager, laparams=LAParams()) as device:
            # PDFPageInterpreterオブジェクトの取得
            iprtr = PDFPageInterpreter(manager, device)
            # ページごとで処理を実行
            for page in PDFPage.get_pages(input):
                iprtr.process_page(page)
                # ページ内の各テキストのレイアウト
                layouts = device.get_result()
                for layout in layouts:
                    # 罫線などのオブジェクトが含まれているとエラーになるのを防ぐため
                    if isinstance(layout, LTTextContainer):
                        # 各ページの左下を原点としている
                        # x0: テキストの左端のx座標
                        # x1: テキストの右端のx座標
                        # y0: テキストの下端のy座標
                        # y1: テキストの上端のy座標
                        # width: テキストの幅(x1 - x0)
                        # height: テキストの高さ(y1 - y0)
                        print(f'{layout.get_text().strip()}, x0={layout.x0:.2f}, x1={layout.x1:.2f}, y0={layout.y0:.2f}, y1={layout.y1:.2f}, width={layout.width:.2f}, height={layout.height:.2f}')

if __name__ == '__main__':
    main()

実行

サンプルコードを実行してみます。
用意したsample.pdfは以下の画像のようなPDFです。

結果は下記となりました。

.log

1 ページ目, x0=128.42, x1=303.47, y0=668.34, y1=704.34, width=175.05, height=36.00
サンプル 1, x0=311.47, x1=484.37, y0=569.46, y1=605.48, width=172.90, height=36.02
サンプル 2, x0=162.14, x1=507.95, y0=422.75, y1=494.78, width=345.81, height=72.02
サンプル 3, x0=66.14, x1=181.34, y0=329.19, y1=353.19, width=115.20, height=24.00
, x0=85.10, x1=90.38, y0=728.33, y1=738.89, width=5.28, height=10.56
2 ページ目, x0=104.42, x1=279.47, y0=663.90, y1=699.90, width=175.05, height=36.00
サンプル 4, x0=116.18, x1=346.63, y0=478.94, y1=526.94, width=230.45, height=48.00
サンプル 5, x0=341.47, x1=427.87, y0=292.35, y1=310.35, width=86.40, height=18.00
, x0=85.10, x1=90.38, y0=728.33, y1=738.89, width=5.28, height=10.56

確かに、テキストごとに座標が取れてきていることが分かります。
なお、4行目と8行目で空テキストの座標が取れてきています。これは、PDFをWordで作ったときに、いくつか含めてしまった改行が原因だと思います。

以上です。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up