More than 5 years have passed since last update.

pythonからtesseractを使った場合のUnicodeDecodeError: '文字コード' codec can't decode byte 0x81 回避

Posted at 2017-10-02

検証環境

Python 2.7 (WinPython-32bit-2.7.10.3)
Spyder 3.0.0 (同梱)
pytesseract 0.1.7
Tesseract-OCR 3.02
Windows7

症状

エラー

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 44: character maps to <undefined>

'charmap'
は設定したデフォルトエンコーディング。
デフォルトではascii。いろいろいじって、utf-8、cp1252(charmap)に変更しても同様のエラーが出ていた。

コード


# -*- coding: utf-8 -*-

import sys
# ↓試行錯誤の跡。なくてもasciiでエラー
sys.setdefaultencoding('cp1252')

import pytesseract
pytesseract.tesseract_cmd = 'パス\tesseract.exe'

from PIL import Image
img = Image.open('images/test.png')

txt =  pytesseract.image_to_string(img)

回避策

utf8やcp1252ではなくsjisなcp932を指定する。


# -*- coding: utf-8 -*-

import sys
sys.setdefaultencoding('cp932')

import pytesseract
pytesseract.tesseract_cmd = 'パス\tesseract.exe'

from PIL import Image
img = Image.open('images/test.png')

txt =  pytesseract.image_to_string(img)

pytesseractも内部的にはsubprocessを使っているようなので、おそらく直接呼んでも同じ。

参考

Python スクリプト実行時に UnicodeDecodeError が出る場合の対処方法 - Over&Out その後

UnicodeDecodeError: 'ascii' codec can't decode
で検索。
エラーが変わったのでエンコードを弄ることに。

pythonの日本語 - $Recycle.Bin

コマンドプロンプトの文字コードはcp932(ほとんどsjisと同じもの)なので、普通はsjisに変換してprintしてくれます。

cp1252でだめだったので一縷の望み。

AttributeError: 'module' object has no attribute 'setdefaultencoding'
relaodするとSpyderの標準出力コンソールが変わる、ように見える。(実行コンソールと出力コンソールが別になったのでprintデバッグからVariable explorerで返り値確認した)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up