Microsoft Azure TechAdvent Calendar 2024

Azure AI SearchでPDFをセクション毎に分割してインデックスに登録する

Last updated at 2024-11-21Posted at 2024-11-20

はじめに

Microsoft Ignite 2024にて、Azure AI Searchに「ドキュメントレイアウトスキル」がプレビューとなりました。「ドキュメントレイアウトスキル」は、Document Intelligenceのlayoutモデルを使って、PDFなどの非構造データをMarkdownのテキスト情報に変換するスキルです。

Markdownでは、章や説のタイトルは#から始まるセクションヘッダーで記述されます。

Markdownの例

# Section 1
Content for section 1.

## Subsection 1.1
Content for subsection 1.1.

# Section 2
Content for section 2.

ドキュメントレイアウトスキルを使うことで、PDFをMarkdownに変換し、さらに下記のようにセクション毎に分割することができます。

{
  "markdown_document": [
     {
    "content": "Content for section 1.\r\n",
    "sections": {
      "h1": "Section 1",
      "h2": ""
    },
    "ordinal_position": 1
  },
  {
    "content": "Content for subsection 1.1.\r\n",
    "sections": {
      "h1": "Section 1",
      "h2": "Subsection 1.1"
    },
    "ordinal_position": 2
  },
  {
    "content": "Content for section 2.\r\n",
    "sections": {
      "h1": "Section 2",
      "h2": ""
    },
    "ordinal_position": 3
  }
  ] 
}

従来の文字数で分割した場合は、セクションタイトルが本文と分離してしまい、検索に引っかからないことがありました。また、区切れ目が中途半端な文章は、RAGの精度にも少なからず影響を与えます。その点、今回のレイアウトスキルを使うと、まとまった意味で文章を扱うことができるので、RAGの精度改善へつながる可能性は高いです。

設定手順

ドキュメントレイアウトスキルを含めた設定手順はこちらの公式に書かれています。

Azure Portalの「データのインポートとベクター化」からポチポチと設定できればとても便利なのですが、2024年11月時点でできないため、手動でインデックスを構築していく必要があります。

(追記 2024/11/21)AI Searchの「East US」リージョンで「データのインポートとベクター化」からドキュメントレイアウトのモデルを選択できるようになっています！こちらのブログは「Japan East」リージョンを使っているので、まだ選択できないですが、将来的にはできるようになりそうです。

インデックスを手動で構築する場合、「インデックス」「データソース」「スキルセット」「インデクサー」の設定をJSON形式で記述し、Azure PortalもしくはREST APIなどで登録する必要があります。今回はVSCodeの拡張機能REST Clientを使いますが、変数を上手いこと読み替えてご活用いただければと思います。

環境変数

## Azure AI Searchのエンドポイント 
@endpoint=https://<resource_name>.search.windows.net
## インデックス名
@index_name=test-index-1
## スキルセット名
@skillset_name=test-skillset-1
## データソース名
@datasource_name=test-datasource-1
## インデクサー名
@indexer_name=test-indexer-1
## Azure AI SearchのAPI Key
@admin_key=<admin_key>
## Azure AI SearchのAPI Version
@api_version=2024-11-01-preview
## AOAIのリソース名
@aoai_resourceUri=https://<resource_name>.openai.azure.com
## AOAIのAPI Key
@aoai_apiKey=<aoai_apiKey>
## Embeddingモデルのデプロイメント名
@aoai_embedding_deploymentId=text-embedding-ada-002
## Embeddingモデルのモデル名
@aoai_embedding_modelName=text-embedding-ada-002
## Blob Storageの接続文字列
@blob_connectionString=<blob_connectionString>
## Blob Storageのコンテナ名
@blob_container_name=<blob_container_name>

APIのバージョンは最新の2024-11-01-previewを指定してください。

インデックスの作成

インデックス作成に関しては下記の通りです。

### インデックス作成
POST {{endpoint}}/indexes?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{index_name}}",
  "fields": [
    {
      "name": "chunk_id",
      "type": "Edm.String",
      "key": true,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "filterable": false,
      "sortable": true,
      "facetable": false,
      "analyzer": "keyword",
      "synonymMaps": []
    },
    {
      "name": "parent_id",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": false,
      "filterable": true,
      "sortable": false,
      "facetable": false,
      "synonymMaps": []
    },
    {
      "name": "chunk",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "analyzer": "ja.lucene",
      "filterable": false,
      "sortable": false,
      "facetable": false,
      "synonymMaps": []
    },
    {
      "name": "title",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "analyzer": "ja.lucene",
      "filterable": false,
      "sortable": false,
      "facetable": false,
      "synonymMaps": []
    },
    {
      "name": "text_vector",
      "type": "Collection(Edm.Single)",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "filterable": false,
      "sortable": false,
      "facetable": false,
      "synonymMaps": [],
      "dimensions": 1536,
      "vectorSearchProfile": "azureOpenAi-text-profile"
    },
        {
      "name": "header_1",
      "type": "Edm.String",
      "searchable": true,
      "analyzer": "ja.lucene",
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "header_2",
      "type": "Edm.String",
      "searchable": true,
      "analyzer": "ja.lucene",
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    },
    {
      "name": "header_3",
      "type": "Edm.String",
      "searchable": true,
      "analyzer": "ja.lucene",
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false
    }
  ],
  "scoringProfiles": [],
  "suggesters": [],
  "analyzers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "normalizers": [],
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
  },
  "semantic": {
    "defaultConfiguration": "semantic-configuration",
    "configurations": [
      {
        "name": "semantic-configuration",
        "prioritizedFields": {
          "titleField": {
            "fieldName": "title"
          },
          "prioritizedContentFields": [
            {
              "fieldName": "chunk"
            }
          ],
          "prioritizedKeywordsFields": []
        }
      }
    ]
  },
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vector-algorithm",
        "kind": "hnsw",
        "hnswParameters": {
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500,
          "metric": "cosine"
        }
      }
    ],
    "profiles": [
      {
        "name": "azureOpenAi-text-profile",
        "algorithm": "vector-algorithm",
        "vectorizer": "azureOpenAi-text-vectorizer"
      }
    ],
    "vectorizers": [
      {
        "name": "azureOpenAi-text-vectorizer",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
          "resourceUri": "{{aoai_resourceUri}}",
          "deploymentId": "{{aoai_embedding_deploymentId}}",
          "apiKey": "{{aoai_apiKey}}",
          "modelName": "{{aoai_embedding_modelName}}"
        }
      }
    ],
    "compressions": []
  }
}

### インデックス削除
DELETE {{endpoint}}/indexes/{{index_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

セクションタイトルを格納するheader_1 header_2 header_3を追加しています。

データソース

データソースの作成に関しては下記の通りです。

POST {{endpoint}}/datasources?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{datasource_name}}",
  "description": null,
  "type": "azureblob",
  "subtype": null,
  "credentials": {
    "connectionString": "{{blob_connectionString}};"
  },
  "container": {
    "name": "{{blob_container_name}}",
    "query": null
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": null,
  "encryptionKey": null,
  "identity": null
}

### データソース削除
DELETE {{endpoint}}/datasources/{{datasource_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

connectionStringには、blobの接続情報を入力します。AI Search->Blob Storageの認証については、ストレージアカウントのSASの接続文字列での認証やマネージドIDでの認証などいくつか選択肢があります。選択仕事に文字列の書き方が異なるため、下記の公式からご確認ください。

blob_container_nameには、AI Searchに登録するドキュメントが格納されたコンテナ名を入力します。

スキルセット

スキルセットの作成に関しては下記の通りです。

POST {{endpoint}}/skillsets?api-version={{api_version}}
content-type: application/json
api-key: {{admin_key}}

{
  "name": "{{skillset_name}}",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
        {
      "@odata.type": "#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill",
      "name": "my_document_intelligence_layout_skill",
      "context": "/document",
      "outputMode": "oneToMany",
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/file_data"
        }
      ],
      "outputs": [
        {
          "name": "markdown_document",
          "targetName": "markdownDocument"
        }
      ],
      "markdownHeaderDepth": "h3"
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#1",
      "description": "Split skill to chunk documents",
      "context": "/document/markdownDocument/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/markdownDocument/*/content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "ja",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#2",
      "context": "/document/markdownDocument/*/pages/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/markdownDocument/*/pages/*",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ],
      "resourceUri": "{{aoai_resourceUri}}",
      "deploymentId": "{{aoai_embedding_deploymentId}}",
      "apiKey": "{{aoai_apiKey}}",
      "modelName": "{{aoai_embedding_modelName}}",
      "dimensions": 1536
    }
  ],
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "{{index_name}}",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/markdownDocument/*/pages/*",
        "mappings": [
          {
            "name": "text_vector",
            "source": "/document/markdownDocument/*/pages/*/text_vector",
            "inputs": []
          },
          {
            "name": "chunk",
            "source": "/document/markdownDocument/*/pages/*",
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/title",
            "inputs": []
          },
          {
            "name": "header_1",
            "source": "/document/markdownDocument/*/sections/h1"
          },
          {
            "name": "header_2",
            "source": "/document/markdownDocument/*/sections/h2"
          },
          {
            "name": "header_3",
            "source": "/document/markdownDocument/*/sections/h3"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

### スキルセット削除
DELETE {{endpoint}}/skillsets/{{skillset_name}}?api-version={{api_version}}
content-type: application/json
api-key: {{admin_key}}

ドキュメントレイアウトスキル#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkillによって、ファイル(PDFなど)->Markdown、Markdown->[ヘッダー＋コンテンツ]の形式に変換され、markdown_documentに格納されます。

テキスト分割スキル"#Microsoft.Skills.Text.SplitSkill"によって、セクション毎に分割されたテキスト/document/markdownDocument/*/contentをさらに文字数で分割し、pagesに出力します。文字数は2,000、オーバラップは500です。

AOAI Embeddingスキルによって#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkillによって、セクション・文字数分割された/document/markdownDocument/*/pages/*に対して、Azure OpenAIのEmbeddingモデルでベクトル化し、text_vectorに格納します。

indexProjectionsで格納したテキストやベクトルをインデックスのフィールドにマッピングします。

インデクサー

インデクサーの作成に関しては下記の通りです。

POST {{endpoint}}/indexers?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{indexer_name}}",
  "description": null,
  "dataSourceName": "{{datasource_name}}",
  "skillsetName": "{{skillset_name}}",
  "targetIndexName": "{{index_name}}",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "allowSkillsetToReadFileData": true
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "title",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

### インデクサー削除
DELETE {{endpoint}}/indexers/{{indexer_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

allowSkillsetToReadFileDataのフィールドをtrueにすると、スキルセットにファイル情報を送ることができます。この設定がfalseの場合、ドキュメントレイアウトのスキルが使えません。

これらのインデックス・データソース・スキルセット・インデクサーをAI Searchに登録し、インデクサーを実行するとAI Searchにてインデックスが構築されます。

結果の確認

今回サンプルデータにはAzure OpenAIの公式ページのPDFを利用しました(内容は少し古いです)。

インデックスの検索窓から適当なクエリで検索をかけてみました。

ヘッダーにセクションタイトルが格納されていることが確認できます。分割についても、最初の段落をきれいに抽出できています。

Azure OpenAI Serviceでは、GPT-4、GPT-4 Turbo with Vision、GPT-3.5-Turbo、埋め込\r\nみモデル シリーズなど OpenAl の強力な言語モデルに、REST API でのアクセスを提供\r\nします。また、新しい GPT-4 と GPT-3.5-Turbo モデルシリーズは一般提供になりまし\r\nた。これらのモデルは、特定のタスクに合わせて簡単に調整できます。たとえば、コ\r\nンテンツの生成、要約、画像の解釈、セマンティック検索、自然言語からコードへの\r\n翻訳などです。ユーザーは、REST API、Python SDK、またはAzure OpenAI Studioの\r\nWeb ベースのインターフェイスを介してサービスにアクセスできます。

表に関してもMarkdownの表形式に変換され、表終わりで綺麗に分割されていることが確認できました。

<table>\r\n<tr>\r\n<th>機能</th>\r\n<th>Azure OpenAI</th>\r\n</tr>\r\n<tr>\r\n<td rowspan=\"4\">使用できるモデル</td>\r\n<td>GPT-4 シリーズ (GPT-4 Turbo with Vision を含む)</td>\r\n</tr>\r\n<tr>\r\n<td>GPT-3.5-Turbo シリーズ</td>\r\n</tr>\r\n<tr>\r\n<td>埋め込みシリーズ</td>\r\n</tr>\r\n<tr>\r\n<td>詳細については、モデルに関するページを参照してください。</td>\r\n</tr>\r\n<tr>\r\n<td rowspan=\"3\">微調整(プレビュー)</td>\r\n<td>GPT-3.5-Turbo (0613)</td>\r\n</tr>\r\n<tr>\r\n<td>babbage-002</td>\r\n</tr>\r\n<tr>\r\n<td>davinci-002</td>\r\n</tr>\r\n<tr>\r\n<td>Price</td>\r\n<td>こちらで入手可能 GPT-4 Turbo with Vision について詳しくは、特別価格情報を参照 してください。</td>\r\n</tr>\r\n<tr>\r\n<td>仮想ネットワークのサポー トとプライベート リンクの サポート</td>\r\n<td>はい(独自のデータに基づくAzure OpenAIを使用しない限り)。</td>\r\n</tr>\r\n<tr>\r\n<td>マネージド ID</td>\r\n<td>はい。Microsoft Entra ID を使用</td>\r\n</tr>\r\n<tr>\r\n<td>UI エクスペリエンス</td>\r\n<td>Azure portal (アカウントとリソースの管理)、 モデルの探索と微調整にはAzure OpenAI Service Studio</td>\r\n</tr>\r\n<tr>\r\n<td>FPGA の リ ー ジ ョ ン 別 の 提 供 状況</td>\r\n<td>モ デ ル の 可 用 性</td>\r\n</tr>\r\n<tr>\r\n<td>コンテンツのフィルター処 理 ☐</td>\r\n<td>プロンプトと入力候補は、自動システムを使ってコンテンツ ポリ シーに対して評価されます。重大度の高いコンテンツはフィルタ ーで除外されます。</td>\r\n</tr>\r\n</table>\r\n\r\n\r\n<!-- PageBreak -->

ただし、セクションタイトルの識別精度に関しては、たまに違和感を感じるものもあったので、今後の精度向上に期待です。

終わりに

今回はドキュメントをセクション毎に分割しインデックスに登録する方法について紹介しました。24年11月時点で少し構築に手間がかかりますが、一度構築してしまうと、インデクサーのスケジュール機能によりインデックスの自動更新を実現できます。RAG実装周りがどんどん便利になっていくのは嬉しいですね。「AI Searchならでは」の機能なので、是非お試しください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up