【最新】Azure AI Document Intelligence による文書構造の解析（Markdown、図、セクション）

Last updated at 2024-11-29Posted at 2024-05-04

Azure AI Document Intelligence と Azure AI Search を組み合わせると、RAG アーキテクチャのデータインジェストをさらに強化することができます。最新のレイアウトモデル（prebuilt-layout）は Microsoft の強力な光学式文字認識 (OCR) 機能の強化バージョンと、ディープラーニングモデルを組み合わせ、テキスト、テーブル、チェックマーク、ドキュメント構造を抽出します。今回は最新の Markdown 機能や図、セクションなどの構造解析データを分析してみましょう。

11/29 最新サービス Azure AI Content Understanding の紹介 🆕

本記事ではレイアウトモデル用のデフォルトの分析例（layout-pageobject.pdf）を使用してドキュメントの構造解析を実行します。Python SDK を用いる場合は、以下のようにして URL から分析できます。

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest, AnalyzeResult

endpoint = "<your-endpoint>"
key = "<your-key>"
docUrl = "https://documentintelligence.ai.azure.com/documents/samples/layout/layout-pageobject.pdf"

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout", AnalyzeDocumentRequest(url_source=docUrl), output_content_format="markdown"
)
result: AnalyzeResult = poller.result()

対応ファイル形式

PDF、JPEG、JPG、PNG、BMP、TIFF、HEIF、DOCX、XLSX、PPTX、HTML

begin_analyze_document で output_content_format="markdown" と指定することによりドキュメントのコンテンツを Markdown 形式で取得することができます。生成 AI 時代の新機能ですね。LangChain の MarkdownHeaderTextSplitter と連携すれば簡単にセマンティックチャンキングを実現できます。セマンティックチャンキングの解説は Azure OpenAI Developers セミナー第3回でも行っています。

<!-- PageHeader="This is the header of the document." -->
This is title
===
# 1\. Text
Latin refers to an ancient Italic language originating in the region of Latium in ancient Rome.

# 2\. Page Objects
## 2.1 Table
Here's a sample table below, designed to be simple for easy understand and quick reference.

| Name | Corp | Remark |
| - | - | - |
| Foo | | |
| Bar | Microsoft | Dummy |

Table 1: This is a dummy table

## 2.2. Figure
<figure>
<figcaption>
Figure 1: Here is a figure with text
</figcaption>

![](figures/0)
<!-- FigureContent="Values 500 450 400 400 350 300 300 250 200 200 100 0 Jan Feb Mar Apr May Jun Months" -->
</figure>

# 3\. Others
AI Document Intelligence is an AI service that applies advanced machine learning to extract text, key-value pairs, tables, and structures from documents automatically and accurately:
 :selected:
clear
 :selected:
precise
 :unselected:
vague
 :selected:
coherent
 :unselected:
Incomprehensible

Turn documents into usable data and shift your focus to acting on information rather than compiling it. Start with prebuilt models or create custom models tailored to your documents both on premises and in the cloud with the AI Document Intelligence studio or SDK. Learn how to accelerate your business processes by automating text extraction with AI Document Intelligence. This webinar features hands-on demos for key use cases such as document processing, knowledge mining, and industry-specific AI model customization.

<!-- PageFooter="This is the footer of the document." -->
<!-- PageNumber="1 | Page" -->

print(result.content)

paragraphs

AnalyzeResult の最上位オブジェクトとして、段落ごとのテキストブロックを抽出します。このコレクション内の各エントリはテキストブロックを表し、抽出されたテキスト (content) と polygon 矩形座標を含みます。role にはタイトル、セクション見出し、ページヘッダー、ページフッターなどの属性を表す論理ロールが格納されます。

def get_paragraphs(result):
    paragraphs = []
    for idx, paragraph in enumerate(result.paragraphs):
        item = {
            "id": "/paragraphs/" + str(idx),
            "content": paragraph.content if paragraph.content else "",
            "role": paragraph.role if paragraph.role else "",
            "polygon": paragraph.get("boundingRegions")[0]["polygon"],
            "pageNumber": paragraph.get("boundingRegions")[0]["pageNumber"]
        }
        paragraphs.append(item)
    return paragraphs

get_paragraphs(result)

sections

階層型ドキュメント構造分析の実行結果が格納されます。非構造化ドキュメントの階層構造を理解するために便利なデータ構造が得られます。具体的には図のようなセクションと各セクション内のオブジェクトの関係を識別する階層情報が含まれます。

def get_sections(result):
    sections = []
    for section in result.sections:
        sections.append(section.elements)
    return sections

get_sections(result)

上記を実行すると、以下のような sections 配列が返ります。/sections/n の n が sections 配列の index に対応します。例えば、/figures/0 は index = 4 なので /sections/4 に含まれるということになります。これにより図表などのオブジェクトがドキュメントのどの部分に位置しているかについての情報を得ることができます。

[['/paragraphs/1', '/sections/1', '/sections/2', '/sections/5'],
 ['/paragraphs/2', '/paragraphs/3'],
 ['/paragraphs/4', '/sections/3', '/sections/4'],
 ['/paragraphs/5', '/paragraphs/6', '/tables/0'],
 ['/paragraphs/15', '/figures/0'],
 ['/paragraphs/37',
  '/paragraphs/38',
  '/paragraphs/39',
  '/paragraphs/40',
  '/paragraphs/41',
  '/paragraphs/42',
  '/paragraphs/43',
  '/paragraphs/44']]

ちなみにこのデータ構造を図のような階層構造に変換するには、以下のような再帰的に探索するアルゴリズムが考えられます。

def explore_sections(input_data, indices, depth=0):
    indent = ' ' * depth  # 階層に応じたインデント
    for idx in indices:
        if idx < len(input_data):
            for path in input_data[idx]:
                print(indent + f"{idx}: {path}")
                if 'sections' in path:
                    number = int(path.split('/')[-1])
                    # 再帰的にさらにそのセクションを探索
                    explore_sections(input_data, [number], depth + 2)

def generate_hierarchy(input_data):
    initial_indices = [0]
    # 最初のリストの全要素を表示するために初期インデックスを0に設定し、そこから探索開始
    explore_sections(input_data, initial_indices)

# 階層構造を生成
generate_hierarchy(get_sections(result))

tables

表を構造化して出力します。抽出される表情報には、列と行の数、行の範囲、列の範囲が含まれます。

def get_tables(result):
    tables = []
    for table_idx, table in enumerate(result.tables):
        cells = []
        for cell in table.cells: 
            cells.append( {
                "row_index": cell.row_index,
                "column_index": cell.column_index,
                "content": cell.content,
            })
        tab = {
                "row_count": table.row_count,
                "column_count": table.column_count,
                "cells": cells
        }
        tables.append(tab)
        return tables
    
get_tables(result)

[{'row_count': 3,
  'column_count': 3,
  'cells': [{'row_index': 0, 'column_index': 0, 'content': 'Name'},
   {'row_index': 0, 'column_index': 1, 'content': 'Corp'},
   {'row_index': 0, 'column_index': 2, 'content': 'Remark'},
   {'row_index': 1, 'column_index': 0, 'content': 'Foo'},
   {'row_index': 1, 'column_index': 1, 'content': ''},
   {'row_index': 1, 'column_index': 2, 'content': ''},
   {'row_index': 2, 'column_index': 0, 'content': 'Bar'},
   {'row_index': 2, 'column_index': 1, 'content': 'Microsoft'},
   {'row_index': 2, 'column_index': 2, 'content': 'Dummy'}]}]

figures

ドキュメント内の図形 (グラフ、イメージ) は、以下のように図の caption(存在する場合)、boundingRegions ドキュメントページ上の図形の空間位置座標(pt)、pageNumber ページ番号、elements 図に関連する、または図を説明するドキュメント内のテキスト要素または段落の識別子などを取得できます。

ちなみに PDF の中身がスキャンされた 1 枚の画像であっても Azure AI Document Intelligence が特定の画像の領域のみを検出できるため、非常に強力です。

if result.figures:
    for idx, figures in enumerate(result.figures):
        print(f"--------Analysis of Figures #{idx + 1}--------")

        if figures.caption:
            title = figures.caption.get("content")
            if title:
                print(f"Caption: {title}")

            elements = figures.caption.get("elements")
            if elements:
                print("...caption elements involved:")
                for item in elements:
                  print(f"......Item #{item}")

            captionBR = []
            caption_boundingRegions = figures.caption.get("boundingRegions")
            if caption_boundingRegions:
                print("...caption bounding regions involved:")
                for item in caption_boundingRegions:
                    #print(f"...Item #{item}")
                    print(f"......Item pageNumber: {item.get('pageNumber')}")
                    print(f"......Item polygon: {item.get('polygon')}")
                    captionBR = item.get('polygon')

        if figures.elements:
            print("Elements involved:")
            for item in figures.elements:
                print(f"...Item #{item}")

        boundingRegions = figures.get("boundingRegions")
        if boundingRegions:
            print("Bounding regions involved:")
            for item in boundingRegions:
                #print(f"...Item #{item}")
                if captionBR != item.get('polygon'): #caption の polygon を除外したい
                    print(f"......Item pageNumber: {item.get('pageNumber')}")
                    print(f"......Item polygon: {item.get('polygon')}")

--------Analysis of Figures #1--------
Caption: Figure 1: Here is a figure with text
...caption elements involved:
......Item #/paragraphs/16
...caption bounding regions involved:
......Item pageNumber: 1
......Item polygon: [1.4183, 6.8082, 3.591, 6.8082, 3.591, 6.9657, 1.4183, 6.9657]
Elements involved:
...Item #/paragraphs/16-36
...
Bounding regions involved:
......Item pageNumber: 1
......Item polygon: [1.0301, 7.1098, 4.1763, 7.1074, 4.1781, 9.0873, 1.0324, 9.0891]

図に caption がある場合は figures.caption に格納されます。このデータ構造により、図と図のキャプションを分離して取得することができます。elements には、図に含まれるテキストブロック /paragraph/16 ～ /paragraph/36 までが含まれます。

図の切り出しと保存

ちょうど Microsoft techcommunity で紹介されていたコードがすぐ使えるので引用します。PyMuPDF ライブラリを利用して直接 PDF から画像として切り出しています。

from PIL import Image
import fitz  # PyMuPDF
import mimetypes
from mimetypes import guess_type
def crop_image_from_image(image_path, page_number, bounding_box):
    """
    Crops an image based on a bounding box.

    :param image_path: Path to the image file.
    :param page_number: The page number of the image to crop (for TIFF format).
    :param bounding_box: A tuple of (left, upper, right, lower) coordinates for the bounding box.
    :return: A cropped image.
    :rtype: PIL.Image.Image
    """
    with Image.open(image_path) as img:
        if img.format == "TIFF":
            # Open the TIFF image
            img.seek(page_number)
            img = img.copy()
            
        # The bounding box is expected to be in the format (left, upper, right, lower).
        cropped_image = img.crop(bounding_box)
        return cropped_image

def crop_image_from_pdf_page(pdf_path, page_number, bounding_box):
    """
    Crops a region from a given page in a PDF and returns it as an image.

    :param pdf_path: Path to the PDF file.
    :param page_number: The page number to crop from (0-indexed).
    :param bounding_box: A tuple of (x0, y0, x1, y1) coordinates for the bounding box.
    :return: A PIL Image of the cropped area.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1).
    # The coordinates are in points (1/72 inch).
    bbx = [x * 72 for x in bounding_box]
    rect = fitz.Rect(bbx)
    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), clip=rect)
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    doc.close()

    return img

def crop_image_from_file(file_path, page_number, bounding_box):
    """
    Crop an image from a file.

    Args:
        file_path (str): The path to the file.
        page_number (int): The page number (for PDF and TIFF files, 0-indexed).
        bounding_box (tuple): The bounding box coordinates in the format (x0, y0, x1, y1).

    Returns:
        A PIL Image of the cropped area.
    """
    mime_type = mimetypes.guess_type(file_path)[0]
    
    if mime_type == "application/pdf":
        return crop_image_from_pdf_page(file_path, page_number, bounding_box)
    else:
        return crop_image_from_image(file_path, page_number, bounding_box)

このような関数を定義し、以下のように polygon の左上座標と右下座標を与えることによって画像を切り出すことができます。注意点として、figures から得られる座標の単位はポイント(pt)ですので、ピクセルへの変換行列を指定します。例えば 300 DPI で画像をレンダリングするためのスケーリングを指定しています。

polygon = [1.0301, 7.1098, 4.1763, 7.1074, 4.1781, 9.0873, 1.0324, 9.0891]
bounding_box = (polygon[0], polygon[1], polygon[4], polygon[5])
image = crop_image_from_file("layout-pageobject.pdf", 0, bounding_box)
#image.show()
image.save("figure_1.png")

Azure AI Search との連携

これら最新の Markdown 機能や図、セクションなどの構造抽出機能を Azure AI Search のカスタムスキルとして実装することにより RAG アーキテクチャにおけるデータインジェストを自動化し、さらにリッチにすることができます。今回の構造抽出機能によって、以前の記事で画像ベクトル検索システムを構築した際の抽出した画像の位置関係が分からないという問題を解決することができます。

今回の機能をフルで使用してマルチモーダル RAG アーキテクチャを強化しましょう。

GitHub

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up