More than 5 years have passed since last update.

PythonでPDFからテキスト抽出

Last updated at 2019-10-20Posted at 2019-10-20

はじめに

全文検索などで、PDFのデータをテキストとして抽出したい場合があります。
PyPDF2というライブラリはいけそうですが、日本語がある場合は
pdfminer.six、Apache Tikaのいずれかを使って日本語を抽出することは可能です。

抽出する関連ライブラリをメモします。

Tikaで抽出するサンプル

Tikaインストール

pip install tika

サンプル

pdf.py

from tika import parser

pdf = parser.from_file("C:/var/k_ryouyouhi_shinseisho.pdf")

print(pdf["content"])

実行すると、JARファイルをダウンロードされますね。テキストも正しく抽出されました。

python .\pdf.py
2019-10-20 18:08:52,392 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\username\AppData\Local\Temp\tika-server.jar.

pytesseract

pytesseract: https://pypi.org/project/pytesseract/

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

画像、チャートがある場合、画像、チャートに入っている文字の抽出も可能のようです。

camelot

camelot: https://camelot-py.readthedocs.io/en/master/

camelotはPDFからテーブルのデータフレームを取得できるようですが、
ローカルでGhostScriptエラーがあって、３２，６４ビットを両方インストールすると、インストールしていないエラーが解消されましたが、
下記のエラーがあって一旦動作確認はやめました。

File "C:\Python37\lib\site-packages\camelot\ext\ghostscript\_gsprint.py", line 169, in init_with_args 
    rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv)
OSError: exception: access violation writing 0x00000080

pip install camelot-py[cv]を実行すると、関連するパッケージはclick, jdcal, et-xmlfile, openpyxl, PyPDF2, sortedcontainers, pdfminer.six, opencv-python, camelot-pyも一緒にインストールされます。

あとはghostscriptをダウンロードしてインストールできます。（Windowsの場合は32ビットが必要みたいです）
ghostscript: https://www.ghostscript.com/download/gsdnld.html

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up