More than 3 years have passed since last update.

PythonでPDFに文字埋め込みたい。（PyOCR + pdf2image + Tesseract)

Last updated at 2021-04-01Posted at 2021-03-31

今回はスキャンした資料を使って文字列検索できるようにOCRをしたい。

問題点：販売されている既製OCRが微妙

OCR専門ソフト

既製のものは以下の三つが有力

Panasonicの読取革命（今はソースネクスト） ¥~12,980
メディアドライブのe.Typist ¥~12,980
ソースネクストの本格読取 ¥~3,784
http://monomania.sblo.jp/article/55737163.html

→既製のOCRソフトウエアは高い&Windowsにのみ対応

付属OCR機能

OCR専門ソフトではないが、専門特化した高度なOCRソフトが「機能の1つ」として付属するソフト

Adobe Acrobat DC - ¥~37000
DocuWorks 9 日本語版 - ￥13,500

→どちらも少しお高いので、、

自分で作ってみる。

全体概観

PDF file
↓(pdf2image + poppler)
Image
↓(PyOCR + Tesseract)
Text
↓(PyOCR + Tesseract）
Text only PDF
↓(Qpdf）
PDF with Text

Part 1 Tesseact OCR　の用意

Tesseract OCR はオープンソースの OCR エンジンである。バージョン4は LSTM をサポートしている。

Homebrew を使っている Mac なので brew install tesseract でインストールできる。

tessdata_best のリポジトリから

英語用 eng.traineddata，
日本語用の jpn.traineddata，
日本語縦書き用の jpn_vert.traineddata

をダウンロードし，Homebrew の Tesseract なら /usr/local/Cellar/tesseract/X.X.X/share/tessdata/ に入れる（もともとある eng.traineddata は best 版ではないらしい）。

Tesseractの使い方 - コマンドラインから -

https://github.com/tesseract-ocr/tesseract
例えば hoge.png に入った英語の画像を認識して hoge.txt に入れたい場合：

tesseract hoge.png hoge

認識対象文字が数字に限られている場合：

tesseract hoge.png hoge digits

この digits は /usr/local/Cellar/tesseract/X.X.X/share/tessdata/configs/digits というファイルを指す。この中には tessedit_char_whitelist 0123456789-. と書き込まれている。こういう config ファイルは自分で追加することができる。

言語を指定するには

tesseract hoge.png hoge -l eng   # 英語
tesseract hoge.png hoge -l jpn -c preserve_interword_spaces=1  # 日本語
tesseract hoge.png hoge -l eng+jpn  # 英語と日本語
tesseract hoge.png hoge -l snum  # シリアルナンバー

などとすればよい。対応する *.traineddata が /usr/local/Cellar/tesseract/X.X.X/share/tessdata に入っている必要がある。

Step 2 PythonからOCRを使う

Python用のOCRツールラッパーであるPyOCRをインストールする
pip install pyocr
現在サポートされているOCRツールは以下の３種類。

Libtesseract
Tesseract
Cuneiform

Step 2 PDFを読み込む準備

pdf2image

pip install pdf2image

そのまま使うと以下のエラーに、、
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

popplerをbrewでインストールする。
$ brew install poppler

Step 3 PDFを加工する

$ brew install qpdf

参考サイト：
https://askuscs.jp/python/tesseract2

Step 4 いざコードを書く。

試しに以下の厚生労働省の毎月勤労統計調査（平成30年9月分結果速報等）の１ページ目だけを読み取ってみます。

読み取り結果は以下のようになります!
「現金」で検索

Supplemental

from PIL import Image
import sys
import pyocr
import pyocr.builders
from pdf2image import convert_from_path
import re
import subprocess
import os
from pathlib import Path

tools = pyocr.get_available_tools()
tool = tools[0]
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[1]
print("Will use lang '%s'" % (lang))

# 対象PDFファイル名
Nm='vyvo(raw)'

#配置先ディレクトリ
pdf_path = Path("./pdf_file/"+Nm+".pdf")
txt_path = Path("./txt_file/"+Nm + ".txt")
out_path = Path("./pdf_file/output/"+Nm+".pdf") 


# PDF -> pagesにImageとして変換する（300dpi指定）
# dpiは無意味かも
pages = convert_from_path(str(pdf_path),300)

txt=''
for i, page in enumerate(pages):
    txt = txt + tool.image_to_string(page,lang='jpn',builder=pyocr.builders.TextBuilder(tesseract_layout=6))
    txt = re.sub('([あ-んア-ン一-龥ー])\s+((?=[あ-んア-ン一-龥ー]))',r'\1\2', txt)    
    
#テキストファイル出力
s = txt
with open(txt_path, mode='w') as f:
    f.write(s)

#マルチページのTIFFとして保存する
image_dir = Path("./image_file")
tiff_name = pdf_path.stem + ".tif"
image_path = image_dir / tiff_name
to_pdf = pdf_path.stem + "_TO"
topdf_path = image_dir / to_pdf

#既存ファイル削除
if os.path.exists(image_path):
    os.remove(image_path)
if os.path.exists(str(topdf_path)+'.pdf'):
    os.remove(str(topdf_path)+'.pdf')
    
pages[0].save(str(image_path), "TIFF", compression="tiff_deflate", save_all=True, append_images=pages[1:])

#テキストオンリーpdfの生成
cmd = 'tesseract -c page_separator="[PAGE SEPRATOR]" -c textonly_pdf=1 "' + str(image_path) + '" "' +  str(topdf_path) +'" -l jpn pdf'
print(cmd)
returncode = subprocess.Popen(cmd,shell=True )
returncode.wait()

#オリジナルのpdfにtextonlyのpdfをオーバーレイして最小サイズのpdfを生成してみる。qpdfの基本コマンドは以下
#to.pdf＝テキストオンリーpdf　org.pdf＝オリジナルpdf　out.pdf=オーバレイ済pdf
#qpdf --overlay to.pdf -- org.pdf out.pdf
if os.path.exists(out_path):
    print('remove　' + out_path)
    os.remove(out_path)
cmd = 'qpdf --overlay "' + str(topdf_path) + '.pdf" -- "' + str(pdf_path) +'" "' + str(out_path) + '" '
print(cmd)
returncode = subprocess.Popen(cmd,shell=True )
returncode.wait()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up