DocumentAIによるOCRの始め方

Last updated at 2025-01-14Posted at 2025-01-14

はじめに

Google Cloud の AI 系プロダクトに Document AI があります。

この Document AI は、構造化・非構造化されたドキュメントから要素の抽出や分類、分割することができます。

つまり、Google 製の OCR です。

この Document AI を使って、画像からテキストを抽出してみたいと思います。

プロセッサの作成

Document AI において OCR を行う AI は、プロセッサという単位で管理されます。

我々は、抽出したい画像をプロセッサに向かってアップロードするわけですね。

では、そのプロセッサを作ってみましょう。

1. プロセッサギャラリーを表示

Document AI ページを開いて、左メニューから「マイプロセッサ」を選び、上部の「プロセッサギャラリー」を選択します。

2. Document OCR を作成する

一般の中から、「Document OCR」の「プロセッサを作成」を選択します。

3. プロセッサを作成

プロセッサ名を入力して、作成ボタンを選択します。

4. プロセッサ作成完了

作成完了しましたね。

ここで、後々コードから呼び出すときになるプロセッサの「ID」「リージョン」を控えておいてください。

サンプル画像の実行

1. サンプル画像の用意

今回は、Google 検索の結果のスクリーンショットで試してみます。

以下画像を適当なローカルに保存します。

2. サンプル実行

先ほどのプロセッサの詳細画面から「テストドキュメントのアップロード」からサンプル画像を選択します。

3. 結果表示

以下のように、各テキストのまとまりと、読み取った文字列が表示されれば成功です。

プログラムから呼ぶ

以下同じ事を、python で再現します。

ADC を使うので、あらかじめ gcloud コマンドでプロジェクトを切り替えておいてください。

import google.auth
from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# 設定
location = "us"
processerId = (ここで上記で控えていたプロジェクトIDを設定)
file_path = (サンプルで設定した画像のファイルパスを設定)
mime = (ファイルのMIMEを設定)
  
# クレデンシャル作成
credential, pid = google.auth.default()
auth_req = google.auth.transport.requests.Request()
credential.refresh(auth_req)

# クライアント作成
opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com", quota_project_id=pid)
client = documentai.DocumentProcessorServiceClient(client_options=opts, credentials=credential)

# OCR リクエスト
with open(file_path, "rb") as image:
  buff = image.read()

raw_document = documentai.RawDocument(content=buff, mime_type=mime)
name = f"projects/{pid}/locations/{location}/processors/{processerId}"
request = documentai.ProcessRequest(name=name, raw_document=raw_document)
response = client.process_document(request)

# OCR レスポンス
text = response.document.text
print(text)

抽出されたテキストが、response.document.text に格納されています。

サンプルのようにブロックごとの情報も得たいと思います。

レスポンスの後にコードを追加します。

document = response.document
for page in document.pages:
  for block in page.blocks:
    print("---")
    for segment in block.layout.text_anchor.text_segments:
      print(text[segment.start_index:segment.end_index])

このように、レスポンス文字列のどの位置がブロックの範囲か？という情報を取得できるので、自前で文字列分解します。

また、ブロック以外にも行やパラグラフ単位等、page には各構造単位の情報が格納されているので、分析に必要な単位で取得するのが良いでしょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up