Qiita Teams that are logged in
You are not logged in to any team

Log in to Qiita Team
Community
OrganizationEventAdvent CalendarQiitadon (β)
Service
Qiita JobsQiita ZineQiita Blog
69
Help us understand the problem. What are the problem?

More than 3 years have passed since last update.

posted at

updated at

PythonでOCR

TL;DR

tesseractをpyocrから呼び出して使う方法。

OCRツールを入れる

今回はtesseractを入れる。
こちらの記事から各環境向けの入れ方を参照すること。
https://github.com/tesseract-ocr/tesseract/wiki

Windows向け

執筆時最新版は4.0.0αのWindows Installer made with MinGW-w64

Choose Componentsにて、Additional language dataからJapaneseを選択する。

インストール後、"C:\Program Files (x86)\Tesseract-OCR"をPathに追加する。

pyocrを入れる

pip install pyocr

Windowsの場合は、事前にAnacondaなどを入れておくと楽。

pyocrからtesseractを呼び出せるか確認

pyocrの公式ページのコードで確認する。
https://github.com/openpaperwork/pyocr

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
Output(Example)
Will use tool 'Tesseract (sh)'
Available languages: eng, equ, jpn, osd
Will use lang 'eng'

文字認識

認識したい画像( iroha.png )

iroha.png

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.

txt = tool.image_to_string(
    Image.open('iroha.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )
# txt is a Python string

Output
い ろ は に ほ へ と ち り ぬ

覚え書き

解像度低い文字画像だと認識上手くいかない。少なくとも、slackのemojiの解像度じゃダメだった。

参考

OCRツール「Tesseract OCR」をインストールしてPythonで使う http://73spica.tech/blog/tesseract_for_python/

Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
69
Help us understand the problem. What are the problem?