More than 1 year has passed since last update.

【Azure AI Document Intelligence】PDFなど非構造型データから生成AIの回答を生成する

Posted at 2024-02-02

はじめに

生成AIで回答の品質を向上するためのナレッジを以下の軸でまとめています。前回の記事では外部ソースのデータを使った回答生成の方法をAzure AI Searchを活用しながら紹介しました。前回はCSVなどデータ形式の整った構造型のデータだったのでシンプルに扱うことができました。今回はPDFなど非構造型のデータを活用して回答を生成する方法をご紹介していきます。他の記事もセットでぜひチェックしてみてください。

本記事は筆者の調査した内容や試した成功/失敗の体験をリアルにお伝えすることで、Azureの活用ノウハウを共有すると共にもっと効率的な方法などあればディスカッションすることで改善していくことを目的にしています。システムの動作を保証するものでないことはご了承ください。

カテゴリ	記事の概要	公開予定日
プロンプトエンジニアリング	目的に合った回答を生成できるように制限を与える：システムメッセージエンジニアリング	公開済み
	プロンプトエンジニアリングでよい回答を得るために意識したい方針	公開済み
外部ソースのデータを使った回答(RAG)	【Azure AI Search】外部ソースのデータを活用した生成AIの回答生成の仕組み	公開済み
	非構造型データを使った回答の作成	2/2
安全に生成AIを活用する	避けるべき情報が含まれないかチェックする	2/3
	モデルの信頼性を評価する	2/8

本記事でやりたいこと

LLMモデルは学習済みのデータに関する情報しか持っていません。信頼できるソースを基に生成AIによる回答を生成することで活用の幅も広がりますし、回答の信頼性向上にもつながります。
一方で信頼できるソースはCSVのような構造化された扱いやすいものばかりではありません。PDFなどの非構造なソースであっても活用できるようにすることでさらに活用の幅を広げることができます。
本記事ではAzure AI SearchとAzure AI Document Intelligenceを活用し非構造データを基に生成AIの回答を生成する方法を実際のユースケースに沿ってご紹介します。

本記事で紹介する仕組みの全体像の概念図は以下です。

Azure AI Document Intelligence

Azure AI Document Intelligence (旧称 Azure Form Recognizer)はPDFや画像などのデータからテキストの抽出を実現するツールです。本記事ではAzure AI Document Intelligenceの事前構築済みレイアウトモデルを使用して文書からテキストとレイアウト情報を抽出しますが、他にも、請求書や領収書、IDや保険証など様々なフォーマットに対応した事前学習済みのモデルが用意されている他、カスタムモデルを構築することも可能です。

LLM関連の論文に関するPDF形式のデータセットからユーザーのクエリに対する回答を生成する

以下のシナリオに基づいて実際の動作を確認していきます。シナリオは以下Microsoft WhatTheHack（セルフ学習、トレーニングイベント開催用のコンテンツ集）を参照しています。データソースとして使用するPDFドキュメントもこちらのものを使っています。

想定シナリオ

8ファイル（合計179ページ）のLLM関連の論文に関するPDF形式のデータセットをもとに、ユーザーからのクエリ（自動プロンプトエンジニアリングとは何か？）への回答を作成するユースケースを通じて、非構造型データをもとに回答を生成する仕組みとコードの具体例を紹介します。

非構造のデータソース（PDF）からテキストを抽出する
Azure AI Document Intelligenceを活用し、PDFからテキストを抽出します。

クエリに関連する記述を検索する
Azure AI Searchで構成したインデックスを活用し、ユーザーからのクエリに関連する情報を取得します。

取得したリファレンスを使用してAzure OpenAIによる回答を生成する
取得した情報をLLMモデルにインプットとして与え、ユーザーからのクエリに対する回答を生成します。

非構造のデータソース（PDF）からテキストを抽出する

Azure AI Document Intelligenceを活用します。まずは前提条件に沿ってAzure Portalからリソースを作成し、Keyとエンドポイントを控えておきます。

以下のコードを使用してPDFデータをJSON形式に抽出して保存します。

extract_local_single_file: Azure AI Document Intelligenceの事前構築済みレイアウトモデルを使用してコンテンツを取得します。DocumentAnalysisClient Classを使っているので詳細なパラメータはドキュメントから確認ください。
extract_files: 抽出した情報をファイルごとにJSON形式で書き込み保存します
get_page_content: 抽出結果をページ番号とページコンテンツの形式で構造化します

# -- raw data
RAW_DATA_FOLDER= 'データを抽出したい対象のフォルダへのパス'
# -- extracted json file 
EXTRACTED_DATA_FOLDER = '抽出したデータを保存するフォルダへのパス'

def extract_local_single_file(file_name: str):
    not_completed = True
    while not_completed:
        with open(file_name, "rb") as f:
            poller = document_analysis_client.begin_analyze_document(
                "prebuilt-layout", document=f
            )
            not_completed=False
    result = poller.result()
    return get_page_content(file_name, result)

def extract_files( folder_name: str, destination_folder_name: str):
    os.makedirs(destination_folder_name, exist_ok=True)
    for file in os.listdir(folder_name):
        if file[-3:].upper() in ['PDF','JPG','PNG']:
            print('Processing file:', file, end='')
        
            page_content = extract_local_single_file(os.path.join(folder_name, file))
            output_file = os.path.join(destination_folder_name, file[:-3] +'json')
            print(f'  write output to {output_file}')
            with open(output_file, "w") as f:
                f.write(json.dumps(page_content))

def get_page_content(file_name:str, result):
    page_content = []
    for page in result.pages:
        all_lines_content = []
        for line_idx, line in enumerate(page.lines):
            all_lines_content.append(' '.join([word.content for word in line.get_words()]))
        page_content.append({'page_number':page.page_number, 
                                'page_content':' '.join(all_lines_content)})
    return {'filename':file_name, 'content':page_content}

extract_files(RAW_DATA_FOLDER, EXTRACTED_DATA_FOLDER)

クエリに関連する記述を検索する

検索にはAzure AI Searchを活用します。詳細は前回の記事もご確認ください。今回のテストにはストレージのサイズの問題でFree Tier(50MBまで)が活用できず、Basic以上が必要です。

検索のためインデックス（reseach-paper-index）を作成します。

document_id: ファイル名＋ページ番号
page_number: ページ番号
file_path: ファイルの保存場所
document_name: ファイル名
page_text: 前章で抽出したページコンテンツ

作成したインデックスに対応する形でドキュメントをアップロードします。

documents=[]
for file in os.listdir(EXTRACTED_DATA_FOLDER):
    with open(os.path.join(EXTRACTED_DATA_FOLDER, file)) as f:
        page_content= json.loads(f.read())
    documents.extend(
        [
            {
                'document_id':page_content['filename'].split('\\')[-1].split('.')[0] + '-' + str(page['page_number']),
                'document_name':page_content['filename'].split('\\')[-1],
                'file_path':page_content['filename'],              
                'page_number':page['page_number'],
                'page_text':page['page_content']
            }
            for page in page_content['content']
        ]
    )

search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents)  
print(f"Uploaded {len(documents)} documents")

インデックスが準備できたので、ユーザーのプロンプト"What is automated prompt engineering?"に対して検索してみます。関連するページが出てきました。フルテキスト検索で3番目にヒットしたものが人間の目で確認したときに一番参考になりそうですが、順番はともかくヒットしたことが確認できました。

取得したリファレンスを使用してAzure OpenAIによる回答を生成する

前節までで取得したリファレンスを使用して生成AIによる回答の生成を行います。具体的には以下のプロンプトを活用します。

ユーザークエリに対して検索結果のリストから一貫した回答を生成する
ユーザークエリ：”What is automated prompt engineering?（自動プロンプトエンジニアリングとは何か？）”
抽出されたページのリスト：前節で取得した検索結果（Top3）

prompt = f"""
Provided below are user query and list of extracted pages from research papers separated by triple backticks.
Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

User Query: ```{query}```
List of Extracted Pages: ```{top_results['page_chunks'].to_list()}```

Answer:
"""

response = get_completion(prompt)
print(response)

リファレンスを与えなかった場合は、自然言語処理における自動プロンプトエンジニアリングだけでなくより広範な回答が生成されました。一方でリファレンスを与えた場合はリファレンスに記載の情報のみについて回答されました。RAGでリファレンスを与えて回答を生成すること、信頼性の高い回答の生成と回答の曖昧性を排除する効果が期待できます。

生成AIによる回答（リファレンスあり）

Automated prompt engineering refers to the process of automatically generating natural language instructions to steer large language models (LLMs) towards desired behaviors. This is done by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. The quality of the selected instruction is evaluated by evaluating the zero-shot performance of another LLM following the selected instruction. A number of guides on prompt engineering as well as tools to aid in prompt discovery have been developed. Some examples of these include Promptomania, MagicPrompt-Stable Diffusion, and the Automatic Prompt Engineer (APE) proposed in a research paper by Yongchao Zhou et al. APE was shown to outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. Another research paper by Taylor Shin et al. proposed AUTOPROMPT, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AUTOPROMPT, they showed that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. They also showed that their prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models.

生成AIによる回答（リファレンスなし）

Automated prompt engineering is the process of designing and implementing automated prompts or messages that are delivered to users through various channels such as email, SMS, or chatbots. These prompts are designed to provide users with relevant information, reminders, or calls to action based on their behavior or preferences. The engineering aspect involves using machine learning algorithms and natural language processing techniques to create personalized and effective prompts that can improve user engagement and satisfaction.

まとめ

PDFなど非構造のデータソースに対しても生成AIの回答を生成する方法をまとめました。企業のポリシーや論文など非構造なドキュメント形式で保存されているデータをソースとして活用したいケースは多いと思いますので、ぜひ実際に試してみてください。ご不明な点やもっと効率の良い方法などあればぜひコメントなどで教えてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up