Help us understand the problem. What is going on with this article?

PythonでOCR

More than 3 years have passed since last update.

TL;DR

tesseractをpyocrから呼び出して使う方法。

OCRツールを入れる

今回はtesseractを入れる。
こちらの記事から各環境向けの入れ方を参照すること。
https://github.com/tesseract-ocr/tesseract/wiki

Windows向け

執筆時最新版は4.0.0αのWindows Installer made with MinGW-w64

Choose Componentsにて、Additional language dataからJapaneseを選択する。

インストール後、"C:\Program Files (x86)\Tesseract-OCR"をPathに追加する。

pyocrを入れる

pip install pyocr

Windowsの場合は、事前にAnacondaなどを入れておくと楽。

pyocrからtesseractを呼び出せるか確認

pyocrの公式ページのコードで確認する。
https://github.com/openpaperwork/pyocr

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
Output(Example)
Will use tool 'Tesseract (sh)'
Available languages: eng, equ, jpn, osd
Will use lang 'eng'

文字認識

認識したい画像( iroha.png )

iroha.png

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.

txt = tool.image_to_string(
    Image.open('iroha.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )
# txt is a Python string

Output
い ろ は に ほ へ と ち り ぬ

覚え書き

解像度低い文字画像だと認識上手くいかない。少なくとも、slackのemojiの解像度じゃダメだった。

参考

OCRツール「Tesseract OCR」をインストールしてPythonで使う http://73spica.tech/blog/tesseract_for_python/

it__ssei
自動テストと定時退社が好き。専門は物理学(大気力学、力学系)
https://twitter.com/it__ssei
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away