Azure AI Document Intelligence で表データを抽出してみる

Last updated at 2024-04-01Posted at 2024-03-26

やりたいこと

OpenAI や Azure OpenAI Service を使って独自のドキュメント内容に基づいて回答させる際、ドキュメント内の表を抽出するのに苦労することがあります。今回は、Azure AI Document Intelligence の機能を使って、ドキュメントのレイアウト分析を行い表を抽出してみます。

Azure AI Document Intelligence

様々なドキュメントを AI によってレイアウトやコンテンツを分析してくれるサービス (旧称: Form Recognizer)。請求書や領収書などの事前構築済みのモデルを利用できたりカスタムモデルを利用することもできる。

コード

Document Intelligence を使って、PDF ファイルの内容を Markdown 形式にしてみる。読み込むドキュメントは MS のサンプル。docs ディレクトリを作成してそこに置いておく。

requirements.txt

azure-ai-documentintelligence==1.0.0b2
openai==1.14.0
python-dotenv==1.0.0
requests==2.31.0

import os

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv

load_dotenv()

# create document intelligence client
AI_SERVICES_ENDPOINT = os.getenv("AI_SERVICES_ENDPOINT")
AI_SERVICES_API_KEY = os.getenv("AI_SERVICES_API_KEY")
di_client = DocumentIntelligenceClient(
    credential=AzureKeyCredential(AI_SERVICES_API_KEY),
    endpoint=AI_SERVICES_ENDPOINT,
)

# load document
path_to_sample_documents = os.path.join(
    os.path.dirname(os.path.abspath(__file__)),
    "docs",
    "invoice-logic-apps-tutorial.pdf",
)

# analyze document
with open(path_to_sample_documents, "rb") as doc:
    poller = di_client.begin_analyze_document(
        "prebuilt-layout",
        analyze_request=doc,
        output_content_format="markdown",
        content_type="application/octet-stream",
    )
result = poller.result()

print(result.content)

出力結果

CONTOSO LTD.

Contoso Headquarters 123 456th St New York, NY, 10001

INVOICE
===

INVOICE DATE: 11/15/2019 CUSTOMER ID: CID-12345 DUE DATE: 12/15/2019 INVOICE: INV-100

CUSTOMER NAME: MICROSOFT CORPORATION SERVICE PERIOD: 10/14/2019 - 11/14/2019

Microsoft Corp 123 Other St, Redmond WA, 98052

Redmond WA, 98052

BILL TO: Microsoft Finance 123 Bill St, Microsoft Delivery SHIP TO: 123 Ship St, Redmond WA, 98052

SERVICE ADDRESS: Microsoft Services 123 Service St, Redmond WA, 98052

| SALESPERSON | P.O. NUMBER | REQUISITIONER | SHIPPED VIA | F.O.B. POINT | TERMS |
| - | - | - | - | - | - |
| | PO-3333 | | | | |

| DATE | ITEM CODE | DESCRIPTION | QTY | UM | PRICE | TAX | AMOUNT |
| - | - | - | - | - | - | - | - |
| 3/4/2021 | A123 | Consulting Services | 2 | hours | $30.00 | 10% | $60.00 |
| 3/5/2021 | B456 | Document Fee | 3 | | $10.00 | 5% | $30.00 |
| 3/6/2021 | C789 | Printing Fee | 10 | pages | $1.00 | 20% | $10.00 |


|||
| - | - |
| SUBTOTAL | $100.00 |
| SALES TAX | $10.00 |
| TOTAL | $110.00 |
| PREVIOUS UNPAID BALANCE | $500.00 |
| AMOUNT DUE | $610.00 |

THANK YOU FOR YOUR BUSINESS!

REMIT TO: Contoso Billing 123 Remit St New York, NY, 10001

output_content_format="markdown" の部分で出力形式を Markdown に指定している。この指定を入れることで Markdown で出力される。指定しない場合は、1行ごとにただのテキストとして出力されてしまう。特に表の部分は何を意味するのかまったく分からなくなる。

この出力結果をチャンキングして、LLM　へのプロンプトに Context として与えれば表の内容について回答してくれるようになる。

ただ、ファイルの内容全体が Markdown になっていて、固定サイズでチャンキングする場合には表が途切れてしまうかもしれないので、もう少し工夫してみる。

上では content を使ったが、poller.result() には、tables という、分析して検出した個別の表がリストで格納されているので、このリストを利用する。table オブジェクトは、表内のセルをすべてリストで持っているので、一度先頭から列数分だけ取り出し行をつくる。その後、各行ごとに Markdown の書式になるように整えていく。列のヘッダを行を区切る |----|----| は表によって位置が変わるため対応させていない (区切りがなくても LLM は表として認識して回答してくれた)。

def get_markdown_table(table) -> str:
    cell_count = len(table.cells)
    column_count = table.column_count

    # split cells into rows
    index = 0
    rows = []
    while index != cell_count:
        rows.append(
            [cell.content for cell in table.cells[index : index + column_count]]
        )
        index += column_count

    # generate markdown table
    markdown_table = ""
    for i, row in enumerate(rows):
        for column in row:
            markdown_table += f"| {column} "
        markdown_table += "|\n"

    return markdown_table

print("These are the tables in the document:")
markdown_tables = []
for i, table in enumerate(result.tables):
    markdown_table = get_markdown_table(table)
    print(f"Table {i}:")
    print(markdown_table)
    markdown_tables.append(markdown_table)

出力結果

These are the tables in the document:
Table 0:
| SALESPERSON | P.O. NUMBER | REQUISITIONER | SHIPPED VIA | F.O.B. POINT | TERMS |
|  | PO-3333 |  |  |  |  |

Table 1:
| DATE | ITEM CODE | DESCRIPTION | QTY | UM | PRICE | TAX | AMOUNT |
| 3/4/2021 | A123 | Consulting Services | 2 | hours | $30.00 | 10% | $60.00 |
| 3/5/2021 | B456 | Document Fee | 3 |  | $10.00 | 5% | $30.00 |
| 3/6/2021 | C789 | Printing Fee | 10 | pages | $1.00 | 20% | $10.00 |

Table 2:
| SUBTOTAL | $100.00 |
| SALES TAX | $10.00 |
| TOTAL | $110.00 |
| PREVIOUS UNPAID BALANCE | $500.00 |
| AMOUNT DUE | $610.00 |

ということで

ドキュメント内の表を LLM に与えて回答してもらう際のデータ準備として、Document Intelligence を使って表の抽出を行ってみました。簡単に分析してくれて表を抽出してくれるのはやはり便利です。
また、今回は取り上げていませんが、章タイトルや段落といったところも分析して出力してくれるので、ファイルを RAG で扱う際の精度向上に使えるかもしれません。出力形式を Markdown に指定する機能は最近追加されているようで、RAG を意識したものだと思います。そういった意味で、今後も RAG や LLM 向けの便利な機能が追加されるのではと思います。

以上です！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up