Azure AI Searchを用いたマルチモーダルRAG作成①

Last updated at 2024-06-14Posted at 2024-06-13

前提として普段あまりこれまでにRAGを組んだことがない、SEではないがプリセールスとして少し技術を勉強したい、これからAzureを触る方をターゲットとした記事になっています（私自身がそういう著者です）

導入

GPT4-oも話題になっている中、今後より注目されていくだろうと思われるのがマルチモーダルRAGではないでしょうか？マルチモーダルRAGを作成する場面は様々あるが、今回は下記を想定して、組み立てていこうと思います。

【想定シナリオ】
ある企業Aでは独自の新入社員向けマニュアルをWordで作成し、PDF（説明文とそれに付随している説明のための画像を含む）にした上で、印刷し配布を行っていました。
ただ、配布を行っているにも関わらず、マニュアルについての質問を多く人事部は受けており、その対応に苦労をしていました。

【解決案】
PDFをRAGとしてデータをつなげることで、PDFの中にある説明文と画像をAzure AI Searchでインデックス化を行い、最終的にGPT4-oを用いて回答生成をします。
※今回はなるべくAzureリソースを使用するとします

解決までの道のり

どのように解決をしていくのか、まずは順序を洗い出しました。
ざっと下記のステップで実装できるのではないでしょうか。

◆インデックスの作成（AI Search）
1. PDFから画像を抽出
2. 画像のベクトル化
3. 文書と画像の紐づけ

◆GPT-4oとAI Searchをつなぐことで回答生成の際に知識を拡張

今回は上記のインデックスの作成に重きを置いた紹介を行います。
使用データとしては日本の世界遺産のwikipediaのページをPDFでダウンロードし、そのデータをblob storageに格納し、そのデータを活用して分析を行うことにします。

いざ実装！

①PDFから画像を抽出する+②画像のベクトル化

いくつか方法は考えられますが、下記にいくつか記載します

AI Search の組み込み機能でBlob に吐き出す
Azure Document Intelligence を使用する
OSSの使用

→今回は1つ目のAI Searchを利用した方法を採用

AI Searchを利用する方法はさまざまあるが、初心者向けの使い方を下記では説明します。その前にAI Searchの仕組みとしてあるデータソースに対してインデクサー（＋スキルセット）を動作させることで、インデックスを作成するという流れになっていることを理解する必要があります。つまりは最低限AI Searchを動かすためには①データソースの準備②インデクサーの定義（＋スキルセットの定義）③インデックスの定義が必要になります。

一番簡単な方法はAI Searchの概要のページにおいて、データのインポートというボタンを選択して、ポチポチしていく方法である。その上で下記にあるJSONの定義例を参考にして、微調整をするのが初心者の私としてはやりやすいやり方でしたのでおすすめです！

インデックスの定義例

下記はJSONで設定する場合の例です。

{
  "name": "azureblob-index",
  "defaultScoringProfile": "",
  "fields": [
    {
      "name": "content",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_storage_content_type",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_storage_size",
      "type": "Edm.Int64",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_storage_last_modified",
      "type": "Edm.DateTimeOffset",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_storage_content_md5",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_storage_name",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_storage_path",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "keyword",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_storage_file_extension",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_content_type",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_language",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "metadata_creation_date",
      "type": "Edm.DateTimeOffset",
      "searchable": false,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "merged_content",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "text",
      "type": "Collection(Edm.String)",
      "searchable": true,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "layoutText",
      "type": "Collection(Edm.String)",
      "searchable": true,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "imageTags",
      "type": "Collection(Edm.String)",
      "searchable": true,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "imageCaption",
      "type": "Collection(Edm.String)",
      "searchable": true,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "content_vector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": 1024,
      "vectorSearchProfile": "vector-profile-1717031400733",
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "parentkey2",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "pages",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "original_path",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "corsOptions": {
    "allowedOrigins": [
      "*"
    ],
    "maxAgeInSeconds": 300
  },
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": null,
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vector-config-1717031401775",
        "kind": "hnsw",
        "hnswParameters": {
          "metric": "cosine",
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500
        },
        "exhaustiveKnnParameters": null
      }
    ],
    "profiles": [
      {
        "name": "vector-profile-1717031400733",
        "algorithm": "vector-config-1717031401775",
        "vectorizer": "vectorizer-1717031407637",
        "compression": null
      }
    ],
    "vectorizers": [
      {
        "name": "vectorizer-1717031407637",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
          "resourceUri": "[Enter your resource url]",
          "deploymentId": "text-embedding-3-large",
          "apiKey": "<redacted>",
          "modelName": "text-embedding-3-large",
          "authIdentity": null
        },
        "customWebApiParameters": null,
        "aiServicesVisionParameters": null,
        "amlParameters": null
      }
    ],
    "compressions": []
  }
}

fieldsでそれぞれのインデックスを登録しています。
最後の部分ではベクトル化をするときのアルゴリズムとベクトル化に使うエンベディングのモデル選択をしています。これはPDFから画像を抽出した際にベクトル化したデータを保持するためにfieldsでcontent_vectorを準備しているために必要になります。

のちの部分でインデックスプロジェクションという親インデックスと子インデックスを作成できる機能を使うためにkeyフィールドはanalyzerとしてkeywordを使用するように設定する必要があり、また親検索ドキュメントのキーを格納するためのフィールドを用意しておく必要があります。
参考：https://learn.microsoft.com/ja-jp/azure/search/index-projections-concept-intro?tabs=kstore-rest

インデクサーの定義例

{
  "name": "azureblob-indexer2",
  "description": "",
  "dataSourceName": "pdfs-33",
  "skillsetName": "azureblob-skillset",
  "targetIndexName": "azureblob-index",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0,
    "base64EncodeKeys": true,
    "configuration": {
      "imageAction": "generateNormalizedImages",
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "allowSkillsetToReadFileData": true
    }
  },
  "fieldMappings": [],
  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/merged_content",
      "targetFieldName": "merged_content"
    },
    {
      "sourceFieldName": "/document/normalized_images/*/text",
      "targetFieldName": "text"
    },
    {
      "sourceFieldName": "/document/normalized_images/*/layoutText",
      "targetFieldName": "layoutText"
    },
    {
      "sourceFieldName": "/document/normalized_images/*/imageTags/*/name",
      "targetFieldName": "imageTags"
    },
    {
      "sourceFieldName": "/document/normalized_images/*/imageCaption",
      "targetFieldName": "imageCaption"
    }
  ],
  "cache": null,
  "encryptionKey": null
}

インデクサーの定義部分での肝は"imageAction"を"generateNormalizedImages"にするところです。この設定をすることで、各PDFから画像を抽出できるようになります。

スキルセットの定義例

{
  "name": "azureblob-skillset",
  "description": "Skillset created from the portal. skillsetName: azureblob-skillset; contentField: merged_content; enrichmentGranularity: document; knowledgeStoreStorageAccount: sorageaicd2nd;",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill",
      "name": "#0",
      "description": null,
      "context": "/document/normalized_images/*",
      "modelVersion": "2023-04-15",
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "vector",
          "targetName": "vector"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "#1",
      "description": null,
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/normalized_images/*/text"
        },
        {
          "name": "offsets",
          "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "merged_content"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "name": "#2",
      "description": null,
      "context": "/document/normalized_images/*",
      "textExtractionAlgorithm": null,
      "lineEnding": "Space",
      "defaultLanguageCode": "ja",
      "detectOrientation": true,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        },
        {
          "name": "layoutText",
          "targetName": "layoutText"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Vision.ImageAnalysisSkill",
      "name": "#3",
      "description": null,
      "context": "/document/normalized_images/*",
      "defaultLanguageCode": "ja",
      "visualFeatures": [
        "tags",
        "description"
      ],
      "details": [],
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "tags",
          "targetName": "imageTags"
        },
        {
          "name": "description",
          "targetName": "imageCaption"
        }
      ]
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
    "description": null,
    "key": null
  },
  "knowledgeStore": {
    "storageConnectionString": null,
    "identity": null,
    "projections": [
      {
        "tables": [],
        "objects": [],
        "files": [
          {
            "storageContainer": "azureblob-skillset-image-projection",
            "referenceKeyName": null,
            "generatedKeyName": "imagepath",
            "source": "/document/normalized_images/*",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "synthesizeGeneratedKeyName": true
    }
  },
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "azureblob-index",
        "parentKeyFieldName": "parentkey2",
        "sourceContext": "/document/normalized_images/*",
        "mappings": [
          {
            "name": "content_vector",
            "source": "/document/normalized_images/*/vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "pages",
            "source": "/document/normalized_images/*/pageNumber",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "metadata_storage_name",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {}
  },
  "encryptionKey": null
}

スキルセットが今回の肝になるところです！
indexProjectionsの部分でPDFから抜き出した画像（sourceContext："/document/normalized_images/*"にその画像が格納されている）に対して、その画像のベクトルを取得しているのと、その画像の元PDFのページ数を取得しているのと、元のPDFの名前を取得しています。

上記では様々なスキルを使っているがその中で、”#Microsoft.Skills.Vision.VectorizeSkill”はBuild2024の中でプレビューされた機能の一つで、Azure AI Vision のマルチモーダル埋め込みの API を使用して、画像またはテキスト入力用の埋め込みを生成するスキルです。
参考：https://learn.microsoft.com/ja-jp/azure/search/cognitive-search-skill-vision-vectorize

実装結果

今回UIを準備していないのであくまでindexがどのように生成されたのかという例をお見せする形になります。

こちらはベクトルの数値がかなりの場所を取るために、設定でベクトルを排除したバージョンの出力はこちらです。

こちらを見て頂くとわかるように各PDFから画像を抜き出し、その画像のベクトルと元のPDF内でのページ数と、元PDFの名称をインデックスに格納されていることが分かります。

ここでなぜoriginal_pathはnullになってしまっているのかと疑問を抱く方もいるのではないでしょうか？
単純にインデクサーで定義をしていないからである。ただもしインデクサーで定義をした場合にもnullになると考えられる。なぜならば、各画像というのはPDFごとのインデックスをいやインデックスとした場合に、子インデックスになっているために、子インデックスの作成時にはスキルセット内のIndexProjections内に定義をする必要があるのである。

ただこのAI Search上での検索エクスプローラーを見るだけでは、親インデックスと子インデックスが正しく表示されていないように思われる。（個人的な意見）

そこで私は下記の記事で紹介されている、Azure AI Testerを使用して検索をかけてみたところ、PDFの文書の内容もヒットしたことから、親インデックスも作成がされていることを確認しました。

参考：

考察と今後の課題

今回の実装でPDFから画像を抜き出し、元PDFのインデックスを親インデックスとして、各画像をもとPDFに紐づける形での、子インデックスを画像の数だけ作成することに成功した。これにより検索をするときに画像についての質問があった際には類似の画像を検索し、該当PDFページの文書データに繋げるようなプロンプトフローを描くことができます。

ただ課題としてはある画像が元PDF内でどこの部分にあったのかまでは分からないこともあり、元PDF内のどのページかまでの情報しかわからないため、PDFのページ一枚分の文書データを引っ張ってきてしまうために、例えばある画像についてこれはどこの寺院？と質問をした場合には、答えられない可能性があり、画像の下のキャプションの情報も画像のインデックスに持たせる必要があります。

そのようなことをするためには今回の方法は不十分であり、Document inteligenceなどの別の方法を試す必要があると思います。

結論としては今回の方法はあくまでPDFから画像を抜き出し、元ページ数を取得するところまでは可能であるが、その画像のキャプションなどの細かい情報を取得するところまではできないです。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up