More than 5 years have passed since last update.

【PDFMiner】PDFからテキストの抽出

Last updated at 2018-06-07Posted at 2018-03-30

python3対応のPDFMiner.sixを使用

インストール

$ pip install pdfminer.six

コマンドが動かない場合

wget https://pypi.python.org/packages/source/p/pdfminer.six/pdfminer.six-20160202.zip
unzip pdfminer.six-20160202.zip
cd pdfminer.six-20160202
python setup.py install

anacondaの場合

図解

引用：Programming with PDFMiner

クラス	機能
PDFParser	PDFファイルからデータの取得
PDFDocument	取得したデータを格納
PDFPageInterpreter	ページを処理する
PDFDevice	必要な形式に変換する

処理の流れ

レイアウト

サンプル

引用：http://gihyo.jp/book/2017/978-4-7741-8367-1

今回はOculusベストプラクティスのPDFを解析して、テキストファイルに出力してみる。

print_pdf_textboxes.py

import sys

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage


def find_textboxes_recursively(layout_obj):
    """
    再帰的にテキストボックス（LTTextBox）を探して、テキストボックスのリストを取得する。
    """
    # LTTextBoxを継承するオブジェクトの場合は1要素のリストを返す。
    if isinstance(layout_obj, LTTextBox):
        return [layout_obj]

    # LTContainerを継承するオブジェクトは子要素を含むので、再帰的に探す。
    if isinstance(layout_obj, LTContainer):
        boxes = []
        for child in layout_obj:
            boxes.extend(find_textboxes_recursively(child))

        return boxes

    return []  # その他の場合は空リストを返す。

# Layout Analysisのパラメーターを設定。縦書きの検出を有効にする。
laparams = LAParams(detect_vertical=True)

# 共有のリソースを管理するリソースマネージャーを作成。
resource_manager = PDFResourceManager()

# ページを集めるPageAggregatorオブジェクトを作成。
device = PDFPageAggregator(resource_manager, laparams=laparams)

# Interpreterオブジェクトを作成。
interpreter = PDFPageInterpreter(resource_manager, device)

# 出力用のテキストファイル
output_txt = open('output.txt', 'w')

def print_and_write(txt):
    print(txt)
    output_txt.write(txt)
    output_txt.write('\n')

with open(sys.argv[1], 'rb') as f:
    # PDFPage.get_pages()にファイルオブジェクトを指定して、PDFPageオブジェクトを順に取得する。
    # 時間がかかるファイルは、キーワード引数pagenosで処理するページ番号（0始まり）のリストを指定するとよい。
    for page in PDFPage.get_pages(f):
        print_and_write('\n====== ページ区切り ======\n')
        interpreter.process_page(page)  # ページを処理する。
        layout = device.get_result()  # LTPageオブジェクトを取得。

        # ページ内のテキストボックスのリストを取得する。
        boxes = find_textboxes_recursively(layout)

        # テキストボックスの左上の座標の順でテキストボックスをソートする。
        # y1（Y座標の値）は上に行くほど大きくなるので、正負を反転させている。
        boxes.sort(key=lambda b: (-b.y1, b.x0))

        for box in boxes:
            print_and_write('-' * 10)  # 読みやすいよう区切り線を表示する。
            print_and_write(box.get_text().strip())  # テキストボックス内のテキストを表示する。

output_txt.close()

実行

$ python print_pdf_textboxes.py 対象のPDF

output.txt


====== ページ区切り ======

----------
Oculus Best Practices
----------
Version 310-30000-02

====== ページ区切り ======

----------
2 | Introduction | Best Practices
----------
Copyrights and Trademarks
----------
© 2017 Oculus VR, LLC. All Rights Reserved.
----------
OCULUS VR, OCULUS, and RIFT are trademarks of Oculus VR, LLC. (C) Oculus VR, LLC. All rights reserved.
BLUETOOTH is a registered trademark of Bluetooth SIG, Inc. All other trademarks are the property of their
respective owners. Certain materials included in this publication are reprinted with the permission of the
copyright holder.
----------
2 |  |

====== ページ区切り ======

----------
Best Practices | Contents | 3
----------
Contents
----------
Introduction to Best Practices..............................................................................4
----------
Binocular Vision, Stereoscopic Imaging and Depth Cues................................. 10
----------
Field of View and Scale.....................................................................................13
----------
Rendering Techniques....................................................................................... 15
----------
Motion................................................................................................................ 17
----------
Tracking.............................................................................................................. 20
----------
Simulator Sickness..............................................................................................23
----------
User Interface..................................................................................................... 30
----------
User Input and Navigation.................................................................................34
----------
Closing Thoughts............................................................................................... 36

====== ページ区切り ======

----------
4 | Introduction to Best Practices | Best Practices
----------
Introduction to Best Practices
----------
VR is an immersive medium. It creates the sensation of being entirely transported into a virtual (or real, but
digitally reproduced) three-dimensional world, and it can provide a far more visceral experience than screen-
based media. These best practices are intended to help developers produce content that provides a safe and
enjoyable consumer experience on Oculus hardware. Developers are responsible for ensuring their content
conforms to all standards and industry best practices on safety and comfort, and for keeping abreast of all
relevant scientific literature on these topics.
・
・
・

おまけ

GoogleドライブのファイルをGoogleドキュメントで開くと、テキストに変換されるらしい。こちらの方が正確で良い。
PDF や写真のファイルをテキストに変換する

105

116

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up