【Python】pytesseract 入門：画像から文字を抽出する最短ルート

Last updated at 2026-01-16Posted at 2026-01-16

はじめに

画像から文字を抽出する OCR（Optical Character Recognition） は、
・帳票処理
・自動化テスト
・セキュリティ検証
など、実務で地味に出番が多い技術です。

本記事では pytesseract を使い、
「動くところまで」ではなく
「実務でハマらずに使えるところまで」 を解説します。

pytesseract とは？

pytesseract は OCR エンジンではありません。

OCRエンジン：Tesseract（C++）
pytesseract：Python から Tesseract を呼び出すラッパー

つまり：

画像 → pytesseract → Tesseract → 文字列

pytesseract 自体は subprocess 的な役割です。

インストール（ここが最初の関門）

1. Tesseract 本体をインストール（必須）

macOS

brew install tesseract

Ubuntu

sudo apt install tesseract-ocr

2. Python ライブラリ

pip install pytesseract pillow

Pillow がないと画像が読めません
OCR 以前の問題で詰みます

最小構成サンプル

from PIL import Image
import pytesseract

img = Image.open("sample.png")
text = pytesseract.image_to_string(img, lang="eng")

print(text)

これで文字が出れば成功です。

Windows 特有の落とし穴

Windows では Tesseract のパス指定が必要なことが多いです。

pytesseract.pytesseract.tesseract_cmd = (
    r"C:\Program Files\Tesseract-OCR\tesseract.exe"
)

指定しないと：

TesseractNotFoundError

が出ます。
pytesseract は「道案内役」なので、道を教えてあげましょう。

日本語

言語パック確認

tesseract --list-langs

macOS（まとめて）

brew install tesseract-lang

使用例

text = pytesseract.image_to_string(img, lang="jpn+chi_sim")

OCR 精度を左右する最大要因：画像前処理

OCRの成功率は8割が前処理で決まります。

定番の前処理パターン

import cv2
from PIL import Image

img = cv2.imread("sample.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

pil_img = Image.fromarray(thresh)
text = pytesseract.image_to_string(pil_img)

経験則

原画像 → ❌
グレースケール → △
二値化 → ◎

「OCR が弱い」のではなく
画像が OCR 向きじゃないだけなことがほとんど。

座標付き OCR

文字＋位置情報を取得

data = pytesseract.image_to_data(
    img, output_type=pytesseract.Output.DICT
)

for i, word in enumerate(data["text"]):
    if word.strip():
        x = data["left"][i]
        y = data["top"][i]
        w = data["width"][i]
        h = data["height"][i]

使い道

OCR + Selenium 自動操作
画面差分検出
UI テストのテキスト検証

image_to_string だけじゃない API

API	用途
image_to_string	単純OCR
image_to_data	座標付き
image_to_boxes	文字単位
image_to_pdf_or_hocr	PDF生成

pytesseract の得意・不得意

得意

スキャン文書
UI テキスト抽出
英語・活字日本語
自動化・検証用途

不得意

手書き文字
歪んだ CAPTCHA
写真背景文字

セキュリティ用途では
「突破できる CAPTCHA ＝そもそも弱い」

他 OCR ライブラリとの比較

ライブラリ	特徴
pytesseract	軽量・ローカル
PaddleOCR	中文最強
EasyOCR	DLベース
Google Vision	高精度・有料

一言で言うと：

pytesseract は「現場向け」、AI魔法はしない

実務・セキュリティでの活用例

Selenium + OCR 自動テスト
PDF 書類の文字抽出
UI 変更検知
弱 CAPTCHA の検証
Red Team の自動操作補助

まとめ

pytesseract は OCRの司令塔
本体は Tesseract
成否は 前処理が8割
座標付き OCR が本命

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up