More than 1 year has passed since last update.

Azure Document Intelligence で PDF ファイルから図形を抽出して画像として出力する

Posted at 2024-04-15

概要

2024年2月の Azure AI Document Intelligence のアップデートでレイアウトモデルが Figure Detection (図形検出)をサポートしました。Microsoft Learn にも書いてありますが、図形検出機能はプレビュー版である API バージョン 2024-02-29-preview などで利用可能です。この API バージョンは「米国東部」「米国西部2」「西ヨーロッパ」のリージョンで作られたアカウントのみで利用可能なので注意です(日本リージョンではまだ)。

Python SDK で利用可能な DocumentIntelligenceClient の既定APIバージョンである 2023-10-31-preview でも利用可能みたいです

Document Intelligence での図形検出

Document Intelligence の Document Models - Get Analyze Result API がドキュメント分析結果として検出した図形を返すようになったみたいです。具体的には、以下のように、ルートレベルの変数としてfiguresが返ってくるようになりました。SDKでいうと、azure.ai.documentintelligence.DocumentFigure が分析結果(azure.ai.documentintelligence.AnalyzeResult)に含まれています。

{
  "apiVersion": "2023-10-31-preview",
  "modelId": "prebuilt-layout",
  "stringIndexType": "textElements",
  "content": "...",
  "pages": [...],
  "tables": [...],
  "paragraphs": [...],
  "styles": [...],
  "contentFormat": "text",
  "sections": [...],
  "figures": [
    {
      "boundingRegions": [
        {
          "pageNumber": 1,
          "polygon": [
            2.2868,
            3.4992,
            2.6877,
            3.4991,
            2.6877,
            3.8465,
            2.2869,
            3.8466
          ]
        }
      ],
      "spans": [
        {
          "offset": 433,
          "length": 0
        }
      ]
    },
    ...
  ]
}

figuresの中身は、検出した図形のドキュメント中のページ番号(pageNumber)と、ページでの箇所(polygon)、あと図形がcontentで出現した箇所を表すspansの連想配列が検出した図形ごとに配列で格納されています。

polygonは検出した図形の四角形の位置を表していて、以下の通りの値となっています (単位は、分析対象のファイルがPDFファイルの場合はインチ、画像の場合はピクセル)。

"polygon": [
    2.2868, // 検出した図形の左上のX座標の値
    3.4992, // 検出した図形の左上のY座標の値
    2.6877, // 検出した図形の右上のX座標の値
    3.4991, // 検出した図形の右上のY座標の値
    2.6877, // 検出した図形の右下のX座標の値
    3.8465, // 検出した図形の右下のY座標の値
    2.2869, // 検出した図形の左下のX座標の値
    3.8466  // 検出した図形の左下のY座標の値
]

Document Intelligence で PDF ファイルを分析する

おまけですが、Azure Document Intelligence にてレイアウトモデルで PDF ファイルを分析して、テキスト抽出するための Python コードは以下の通りです。

# pip install azure-core azure-ai-documentintelligence
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentAnalysisFeature

# 処理対象のPDFファイルのパス
pdf_file_path = ""

# 使用する Azure AI Document Intelligence アカウントのエンドポイントとキー
endpoint = "https://xxx.cognitiveservices.azure.com/"
key = ""

credential = AzureKeyCredential(key)
client = DocumentIntelligenceClient(endpoint=endpoint, credential=credential)

with open(pdf_file_path, "rb") as f:
    poller = client.begin_analyze_document(
        "prebuilt-layout",
        analyze_request=f,
        locale="ja-JP",
        features=[DocumentAnalysisFeature.OCR_HIGH_RESOLUTION],
        content_type="application/octet-stream",
    )
    ocr_result = poller.result()
    ocr_result = result.as_dict() # dict 型で扱いたいとき

これはキー認証を使った時の場合のコードで、マネージドID認証を行う場合は、変数 credential に DefaultAzureCredential を使うと良いです。その際は、アクセス元が使用するプリンシパルには、Cognitive Services User 権限が必要です。

# pip install azure-identity
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

あとこれはローカルファイルを Document Intelligence にアップロードするやり方ですが、あらかじめ Blob Storage にアップロードしておいたファイルの URL + SAS を渡して処理させる、みたいなこともできます。もし Document Intelligence アカウントのマネージドIDを有効にしておいて、かつそのプリンシパルに分析対象 PDF ファイルが格納されている Blob Storage への Storage Blob Data Reader の権限(か、それ以上)が付与されていれば、SASもなしで URL のみの指定で良いみたいです。

from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
poller = self.client.begin_analyze_document(
    model,
    analyze_request=AnalyzeDocumentRequest(url_source=url),
    locale="ja-JP",
    features=[DocumentAnalysisFeature.OCR_HIGH_RESOLUTION],
)

分析開始時に、features に DocumentAnalysisFeature.OCR_HIGH_RESOLUTION を指定することで、High resolution extractionアドオンが有効になり、OCR精度が上がります。ただ、分析料金も上がることと、(おそらく)PDFファイルでしか使用できない(Officeファイルでは間違いなく使えない)ことに注意。

features=[DocumentAnalysisFeature.OCR_HIGH_RESOLUTION]

これを見ればよいです：
Python 用 Azure AI ドキュメントインテリジェンスクライアントライブラリ | Microsoft Learn

検出した図形を画像として出力する

PDFファイルから検出した図形情報を基に、検出した図形を画像ファイルとして出力するために、以下のような手順で行いたいと思います。

PDFファイルの各ページを画像ファイルに変換する
検出した図形情報を基に、各ページの画像ファイルから図形箇所を画像としてくり抜いて出力する

Python で実装すると、以下の通りになりました。

# pip install pillow opencv-python PyMuPDF azure-core azure-ai-documentintelligence
import cv2
import fitz
import numpy as np
from PIL import Image
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentAnalysisFeature

# 処理対象のPDFファイルのパス
pdf_file_path = ""

# 使用する Azure AI Document Intelligence アカウントのエンドポイントとキー
endpoint = "https://xxx.cognitiveservices.azure.com/"
key = ""

credential = AzureKeyCredential(key)
client = DocumentIntelligenceClient(endpoint=endpoint, credential=credential)

# 指定した PDF ファイルを Document Intelligence で分析
with open(pdf_file_path, "rb") as f:
    poller = client.begin_analyze_document(
        "prebuilt-layout",
        analyze_request=f,
        locale="ja-JP",
        features=[DocumentAnalysisFeature.OCR_HIGH_RESOLUTION],
        content_type="application/octet-stream",
    )
    ocr_result = poller.result()
    ocr_result = ocr_result.as_dict()  # dict 型で扱いたいとき

# 抽出した図形情報をページごとにまとめる
page_figures = {}
for figure in ocr_result["figures"]:
    regions = figure["boundingRegions"]
    for region in regions:
        pageNumber = region["pageNumber"]
        polygon = region["polygon"]
        if pageNumber not in page_figures:
            page_figures[pageNumber] = []
        page_figures[pageNumber].append(polygon)

# 各ページごとに抽出した図形を切り出して出力する
dpi = 150
for page_number in page_figures.keys():
    # PyMuPDF で PDF ファイルのページを画像形式に変換する
    reader = fitz.open(pdf_file_path)
    page = reader[page_number - 1]
    pix = page.get_pixmap(dpi=dpi)
    img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
    img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)

    # 当該ページの抽出した図形情報を基に、画像形式のPDFファイルから図形箇所をくり抜いて出力する
    polygons = page_figures.get(page_number, [])
    for i, polygon in enumerate(polygons):
        cropped_img = img[int(polygon[1] * dpi) : int(polygon[5] * dpi), int(polygon[0] * dpi) : int(polygon[4] * dpi)]
        cv2.imwrite(f"images/{page_number}_{i}.png", cropped_img)

まず指定した PDF ファイルを Document Intelligence で分析しています。
次に、抽出した図形情報(figures)の polygon の値をページごとにまとめる処理を行っています。
最後に、各ページごとに以下の処理を行っています：

PyMuPDFで PDF ファイルのページを画像形式に変換する(具体的には PIL.Image.Image → numpy.ndarray)
当該ページの抽出した図形情報を基に、画像形式のPDFファイルから図形箇所をくり抜いて出力する

まとめ

Azure AI Document Intelligence の2024年2月のアップデートで図形抽出が可能になった
図形抽出結果には、抽出した図形の出現ページ番号と位置情報が含まれている
PDFファイルから直接画像としてくり抜くのは難しいので、一旦PDFファイルの各ページを画像化して、そこから位置情報を使ってくり抜いて出力する方式で実現できた

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up