LTS Group（エル・ティー・エスグループ）Advent Calendar 2024

画像認識の精度について調査してみた

Last updated at 2024-12-15Posted at 2024-12-15

はじめに

初めまして！@keisuke_0503と申します。
ふと気になって、画像認識って最近はどうなったんだ、ということを色々と調べたので共有していきたいと思います！

目的

今回は、画像認識に関連するさまざまなサービスの違いを比較・まとめることを目的としています。

この記事で扱うこと

python pytesseractライブラリ
AWS Textact
Azure Document intelligence
Google Cloud Vision

この記事で扱わないこと

Dockerなどの基本的な説明は割愛させていただきます。

前提

Mac, Docker, pythonで画像認識のライブラリや外部サービスを使っていきます。

Dockerfile

FROM python:3
USER root

RUN apt-get update
RUN apt-get install -y vim less tesseract-ocr
RUN pip install --upgrade pip
RUN pip install --upgrade setuptools
RUN pip install pytesseract pillow azure-ai-formrecognizer azure-identity

docker-compose.yml

version: '3'
services:
  python3:
    restart: always
    build: .
    container_name: 'python3'
    working_dir: '/root/'
    tty: true
    volumes:
      - ./test:/root/test

ディレクトリ構造

dir
┗test
    ┗images
    ┗実行するpythonファイル
    ┗Googleの認証json
┗Dockerfile
┗docker-compose.yml

使用する画像

読み込むファイルは以下の3つ、それぞれ英語の「不思議の国のアリス」の冒頭、日本語の「不思議の国のアリス」の冒頭、そして私の個人的なトレーニング記録の画像の3枚です。

予想する結果

最初に私のそれぞれのライブラリの認識結果の予想を述べておこうと思います。
ズバリ、 「文字の認識はサービスによって大きな違いはないが、トレーニング記録の認識にそれぞれの特徴が出るのではないか」 です！

もう少し詳しく書くと、pythonのライブラリにはトレーニング記録などの読みにくい画像はあまり認識ができない気がしますが、外部サービスではうまくいきつつ、それぞれの特性が出るのではないかなと予想しています。

それでは、早速画像認識にかけてみましょう！！

パターン1 python pytesseractライブラリを用いた画像認識

使用したファイル

pytesseract_image_recognition.py

from PIL import Image
import pytesseract

def extract_text_from_image(image_path):
    # 画像を読み込む
    img = Image.open(image_path)

    # 画像から文字を抽出
    text = pytesseract.image_to_string(img, lang='eng')  # 日本語の場合は 'jpn'

    return text

# 画像ファイルのパス
image_path = 'test/images/eng_text.png'

# テキストを抽出
extracted_text = extract_text_from_image(image_path)

# 結果を表示
print("抽出されたテキスト:")
print(extracted_text)

実行コマンド

docker compose exec python3 python pytesseract_image_recognition.py

実行結果

実行結果はそれぞれ以下のようになりました。

Alice was beginning to get very tired of sitting by her sister on the bank, and of having
nothing to do: once or twice she had peeped into the book her sister was reading, but it
had no pictures or conversations in it, 'and what is the use of a book,' thought Alice
‘without pictures or conversation?’

So she was considering in her own mind (as well as she could, for the hot day made
her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be
worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit
with pink eyes ran close by her.

アリスは川辺でおねえさんのよこにすわって、なんにもすることがないのでとても退
屈 (たいくつ) しはじめていました。一、二回はおねえさんの読んでいる本をのぞいて
みたけれど、そこには絵も会話もないのです。「絵や会話のない本なんて、なんの役にも
たたないじゃないの」とアリスは思いました。

そこでアリスは、頭のなかで、ひなぎくのくさりをつくったら楽しいだろうけれど、起
きあがってひなぎくをつむのもめんどくさいし、どうしようかと考えていました (とい
っても、昼間で財いし、とってもねむくて頭もまわらなかったので、これもたいへんだ
ったのですが)。そこへいきなり、ピンクの目をした白うさぎが近くを走ってきたので
す。

[concept 2. PI

am” =
Display Menu -

パターン1考察

文章の画像は本当に精度が高いですね！ほとんど完璧で、「昼間で暑いし」が「昼間で財いし」になってしまったところ以外は全く同じでした。英語に至っては、改行の位置や記号以外は全て正しく認識されていました。

とはいえ、やはりというか、トレーニング記録などのシステムにとって読みづらい形式になると一気に精度が落ちますね。私の予想としては、数字の羅列は得られるのではないかと考えたのですが、どうやらモニター外の文字のみが検出されたようです。

パターン2 AWS Textactを用いた画像認識

準備

IAMでAmazonTextractFullAccessのポリシーを付与したユーザーを作成し、そのアクセスキーとシークレットアクセスキーを取得してください。

使用したファイル

aws_textact.py

import boto3

def extract_text_from_image(file_path):
    # Textractクライアントを作成
    textract = boto3.client(
        'textract',
        aws_access_key_id='アクセスキーを入力',
        aws_secret_access_key='シークレットアクセスキーを入力',
        region_name='us-east-1'
    )

    # ローカル画像ファイルを開く
    with open(file_path, 'rb') as document:
        image_bytes = document.read()

    # Textract APIを呼び出し
    response = textract.detect_document_text(Document={'Bytes': image_bytes})

    # 抽出されたテキストを取得
    extracted_text = ''
    for item in response['Blocks']:
        if item['BlockType'] == 'LINE':  # 各行のテキストを取得
            extracted_text += item['Text'] + '\n'

    return extracted_text

# 画像ファイルのパスを指定
image_path = 'test/images/eng_text.png'

# テキストを抽出
text = extract_text_from_image(image_path)

# 結果を表示
print("抽出されたテキスト:")
print(text)

実行コマンド

docker compose exec python3 python aws_textact.py

実行結果

実行結果はそれぞれ以下のようになりました。

Alice was beginning to get very tired of sitting by her sister on the bank, and of having
nothing to do: once or twice she had peeped into the book her sister was reading, but it
had no pictures or conversations in it, 'and what is the use of a book," thought Alice
'without pictures or conversation?"
So she was considering in her own mind (as well as she could, for the hot day made
her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be
worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit
with pink eyes ran close by her.

EF
(tELI<) -
(th
DT, ,
to

View Detail
5:00
May 11 2023
time meter /500m 5/11
5:00.0
1236 2:01.3 17
1:00.0
243 2:03.4 18
2:00.0
249 2:00.4 18
3:00.0
244 2:02.9 17
4:00.0
247 2:01.4 18
5:00.0
254 1:58.1 17

パターン2考察

ちょっと意外というか、言われてみれば納得というか、英語の文章はpytesseractと同様、完璧に認識できているのですが、日本語になると全く機能しませんでした。そもそも日本語には対応していないようですね。日本語に対応してくださるのを心待ちにしつつ、それはまた次回のお楽しみということで・・・

パターン3 Azure Document intelligenceを用いた画像認識

準備

Document intelligenceのリソースグループを作成し、Keys and endpointの欄から、エンドポイントとキー（2つありますが、どちらでも大丈夫です。）を取得してください。

使用したファイル

azure_document_intelligence.py

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

# Azureのエンドポイントとキー
endpoint = 'エンドポイントを入力'
credential = AzureKeyCredential('キーを入力')

# Document Analysisクライアントを作成
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=credential)

# 画像ファイルパス
image_path = 'test/images/jpn_text.png'

def extract_text_from_document(image_path):
    with open(image_path, "rb") as image:
        poller = document_analysis_client.begin_analyze_document("prebuilt-read", image)
        result = poller.result()

    # 抽出されたテキストを取得
    extracted_text = ""
    for page in result.pages:
        for line in page.lines:
            extracted_text += line.content + "\n"
    
    return extracted_text

# テキストを抽出
text = extract_text_from_document(image_path)
print("抽出されたテキスト:")
print(text)

実行コマンド

docker compose exec python3 python test/azure_document_intelligence.py

実行結果

Alice was beginning to get very tired of sitting by her sister on the bank, and of having
nothing to do: once or twice she had peeped into the book her sister was reading, but it
had no pictures or conversations in it, 'and what is the use of a book,' thought Alice
'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made
her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be
worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit
with pink eyes ran close by her.

アリスは川辺でおねえさんのよこにすわって、なんにもすることがないのでとても退
屈(たいくつ)しはじめていました。一、二回はおねえさんの読んでいる本をのぞいて
みたけれど、そこには絵も会話もないのです。「絵や会話のない本なんて、なんの役にも
たたないじゃないの」とアリスは思いました。
そこでアリスは、頭のなかで、ひなぎくのくさりをつくったら楽しいだろうけれど、起
きあがってひなぎくをつむのもめんどくさいし、どうしようかと考えていました(とい
っても、昼間で暑いし、とってもねむくて頭もまわらなかったので、これもたいへんだ
ったのですが)。そこへいきなり、ピンクの目をした白うさぎが近くを走ってきたので
す。

concept 2®
PI
View Detail
5:00
May 11 2023
time meter
/500m
5:00.0
1236 2:01.3 17
000
4
1:00.0
243 2:03.4
18
2:00.0
249
2:00.4
18
3:00.0
244 2:02.9
17
4:00.0
247 2:01.4
18
5:00.0
254 1:58.1
17
Display
Menu
Units

パターン3考察

かなりいいですね！AWSが日本語未対応だったので、Azureはどうかと思っていたのですが、見事に日本語も認識してくれました！また、pytesseractで間違えていた漢字も正しく出力できているあたりは、かなり優秀と言えるのではないでしょうか！

パターン4 Google Cloud Visionを用いた画像認識

準備

Google Cloudのコンソール画面から、「APIを追加」を押下して、Cloud Vision APIを追加する。
APIの管理画面から「認証情報」を押下して、サービスアカウントを追加し、鍵をjson形式で作成する。

使用したファイル

google_cloud_vision.py

from google.cloud import vision
import os

def extract_text_from_image(image_path):
    # サービスアカウントキーのパスを環境変数に設定
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'test/google_application_credentials.json'

    # Vision APIクライアントを作成
    client = vision.ImageAnnotatorClient()

    # 画像ファイルを読み込む
    with open(image_path, "rb") as image_file:
        content = image_file.read()
    image = vision.Image(content=content)

    # OCRを実行
    response = client.text_detection(image=image)

    # 抽出されたテキストを取得
    texts = response.text_annotations
    if texts:
        print("抽出されたテキスト:")
        print(texts[0].description)  # 全体のテキスト
        return texts[0].description
    else:
        print("テキストが検出されませんでした。")
        return ""

# 画像ファイルのパスを指定
image_path = 'test/images/training_result.jpg'

# テキストを抽出
extracted_text = extract_text_from_image(image_path)

実行コマンド

docker comopse exec python3 python test/google_cloud_vision.py

実行結果

Alice was beginning to get very tired of sitting by her sister on the bank, and of having
nothing to do: once or twice she had peeped into the book her sister was reading, but it
had no pictures or conversations in it, 'and what is the use of a book,' thought Alice
'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made
her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be
worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit
with pink eyes ran close by her.

アリスは川辺でおねえさんのよこにすわって、 なんにもすることがないのでとても退
屈(たいくつ)しはじめていました。 一、二回はおねえさんの読んでいる本をのぞいて
みたけれど、そこには絵も会話もないのです。 「絵や会話のない本なんて、なんの役にも
たたないじゃないの」 とアリスは思いました。
そこでアリスは、頭のなかで、 ひなぎくのくさりをつくったら楽しいだろうけれど、起
きあがってひなぎくをつむのもめんどくさいし、 どうしようかと考えていました(とい
っても、昼間で暑いし、とってもねむくて頭もまわらなかったので、これもたいへんだ
ったのですが)。 そこへいきなり、ピンクの目をした白うさぎが近くを走ってきたので
す。

Oconcept 2.
View Detail
5:00
May 11 2023
time
S
meter 500m m
5:00.0 1236 2:01.3 17
Pl
1:00.0 243 2:03.4 18
2:00.0
249 2:00.4 18
3:00.0
244 2:02.9 17
4:00.0
247 2:01.4 18
5:00.0
254 1:58.1 17
Units
Display
Menu

パターン4考察

こちらもかなりいいですね！英語はもちろん、日本語にもしっかり対応していました！

全体考察

ここまでの結果を表にまとめると以下のようになると思います。

サービス名	英文	日本語文	リザルト画面
pytesseract	◎	⚪︎	×
AWS Textact	◎	×	⚪︎
Azure Document intelligence	◎	◎	⚪︎
Google Cloud Vision	◎	◎	⚪︎

やはり、当初の予想通り、一番差になるのはリザルト画面のような、システムとして読み込みにくい画像をそのまま渡された時の挙動でした。

文章を画像から起こすのであれば、どれを使っても大きな差はないかもしれませんが、文章以外の文字を読み込ませるときに差が生まれそうですね。

個人的には、リザルト画面の出力にそれぞれの癖があるかなと感じています。

AWSはかなり綺麗に出してくれていますね。改行の位置が揃っているのがわかると思います。もしこれの再現性が高ければ、システムの中に組み込みやすいかもしれません。
また、余計な文字を読み込まないでいてくれるというのも特徴かもしれません。AzureとGoogleは、モニター外にある文字まで読み込んでいましたが、AWSのみ、モニター内を拾ってくれました。

Azureはかなり正確に読み込めるな、という印象です。今回検証したサービスで、®︎をきちんと読み込めたのはAzureのみでした。
また、きちんと左上から右下へ読んでいくので、どこを読んでその文字を認識したのかがわかりやすかったです。

Googleもきちんと形を拾ってくれている印象です。®︎までは読めませんでしたが、「.」と出力してくれたのはおそらく®︎があったからでしょうし、concept2の左隣にある丸を「O」と読んだのは面白いなと思いました。記号も文字に頑張って変換してくれたのはGoogleだけでした。

まとめ

いかがでしたのでしょうか？

個人的に気になったので調べてみたのですが、意外とそれぞれのサービスで癖が生まれているのかなという印象を覚えました。

もちろん、これ以外にも大量の画像データで本当にその癖が出るのかどうかはやってみないとわかりませんが、個人的には大雑把な理解と、新しい技術の開拓、何より知識欲が満たされたので一旦満足です。

最後まで読んでいただき、ありがとうございました！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up