More than 5 years have passed since last update.

TesseractをUbuntuにインストールしてPythonで使う

Last updated at 2019-05-14Posted at 2019-05-08

TesseractをUbuntuにインストールしてPythonラッパーを通じて使ってみました。

環境

前提としてCURLとpyenvが入っているくらいでしょうか。pyenvインストールと設定については記事「UbuntuにpyenvとvenvでPython開発環境構築」を参照ください。

種類	バージョン	内容
OS	Ubuntu18.04.01 LTS	仮想で動かしています
Tesseract	4.1.0	2019/4/01時点で最新です
pyenv	1.2.9	複数Python環境を使うことがあるのでpyenv使っています
Python	3.7.2	pyenv上でpython3.7.2を使っていますパッケージはvenvを使って管理しています
pytesseract	0.2.6	TesseractのPythonラッパー

インストール手順

1. Tesseractインストール手順

公式Installationの手順に従いました。

1.1. PPA追加

PPA(パーソナル・パッケージ・アーカイブ)を追加します。

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update

1.2. Tesseractインストール

Tesseract本体をインストールします。

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

バージョン確認。4.1.0がインストールされています。

$ tesseract -v
tesseract 4.1.0-rc1-184-g497d
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

1.3. 訓練済モデルインストール

訓練済みモデルをインストール。"vert"が接尾時についているのは縦書き(vertical)です。scriptは手書き文字？

sudo apt install tesseract-ocr-jpn  tesseract-ocr-jpn-vert
sudo apt install tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert

モデルがインストールされたかを確認。

$ tesseract --list-langs
List of available languages (6):
Japanese
Japanese_vert
eng
jpn
jpn_vert
osd

1.4. Tesseract実行

あとは実行するだけです。以下の例では、"oder3.pdf0.png"というファイルを読み込んで、日本語OCR結果を"order3_output"というTEXTファイルで出力しています。

tesseract order3.pdf0.png order3_out -l jpn

以下の例では、TSVファイルで出力しています。

tesseract order3.pdf0.png order3_out -l jpn tsv

PDFは対応していないので、PDFの場合には一度画像系ファイルに変換が必要です。
コマンドライン実行方法詳細は"Command Line Usage"にあります。

Tesseract参考リンク

ImproveQuality：精度向上のヒントが書いてあります。前処理も自動で結構してくれていることに驚き
Manual Page：実行オプションの詳細

2. Pythonラッパーインストール

私のPython環境はPyenvでPythonのバージョンを管理し、venvでパッケージ管理をしています。少し古いブログエントリですが、「Tesseractの各言語のラッパーいろいろ（随時更新）」を見るとPythonラッパーは3種類ありそう。ググったり、stackoverflowで一番情報が多い"pytesseract"を使ってみます。
※なぜかQiitaでは"pyocr"が多かったです。stackoverflowではタグすらなかったのに何故だろう・・・
※あとで気づきましたが公式リンクをもとに探すべきでした。

2.1. venv仮想環境有効化

venvの仮想環境を有効化します。

source <path>/bin/activate

2.2. pytesseractのインストール

pipで仮想環境にインストールします。依存パッケージとしてPillowもインストールされます。

pip install pytesseract

2.3. pytesseractを実行して確認

pytesseractを実行して確認します。

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

# 日本語の画像ファイル
FILENAME = './pdf/order3.pdf0.png'

# デフォルト言語の英語で実行されるため意味なし
print(pytesseract.image_to_string(Image.open(FILENAME)))

# 日本語で文字出力
print(pytesseract.image_to_string(Image.open(FILENAME), lang='jpn'))

# ボックス(座標位置付き)出力
print(pytesseract.image_to_boxes(Image.open(FILENAME), lang='jpn'))

# TSV出力(多分、一番詳細情報あり)
print(pytesseract.image_to_data(Image.open(FILENAME), lang='jpn'))

# OSD(Orientation and script detection)
print(pytesseract.image_to_osd(Image.open(FILENAME), lang='jpn'))

# HOCR形式出力
print(pytesseract.image_to_pdf_or_hocr(FILENAME, extension='hocr'))

おまけ

「サムネイル画像に対するテキスト認識の性能比較について」のコードを参考に検出した文字をバウンディングボックスつきでファイルに付記するコードです。

import cv2
import pytesseract

img = cv2.imread("./pdf/order3.pdf0.png")
h, w, _ = img.shape

boxes = pytesseract.image_to_boxes(img, lang="jpn")
boxes = [list(map(int, i)) for i in [b.split(" ")[1:-1] for b in boxes.split("\n")]]

for b in boxes:
    img = cv2.rectangle(img, (int(b[0]), h - int(b[1])), (int(b[2]), h - int(b[3])), (0, 255, 0), 2)

def calc_area(box):
    return ((box[2] - box[0]) * (box[3] - box[1]))

print(sum([calc_area(box) for box in boxes]) / (h * w))

# 画像ポップアップ出力
cv2.imshow('dst',img)
cv2.waitKey(0)
cv2.destroyAllWindows()

# ファイル保存
cv2.imwrite("./pdf/order3.pdf0_out.png",img)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up