AI Search: Document IntelligenceとOpenAI を使用して画像言語化(REST API)

Posted at 2025-10-21

Azure AI SearchでAzure Document Intelligenceを使って画像切り抜きの後に、画像を言語化してインデックス化しました。以下のチュートリアルとほぼ同じです。
PDFファイルの画像を切り抜いて、テキスト化、ベクトル化します。

完成時のパイプライン

デバッガセッション使うとこんなパイプラインに可視化できます。

前提

REST Client 0.25.1 をVS Codeから使って実行しています。

REST Client 設定で Decode Escaped Unicode Characters を ON にするとHTTP Response Bodyの日本語がデコードされます。

また、以下の記事の1～5までのStepも前提作業です。

Steps

1. インデックス作成

1.0. 固定値定義

固定値を定義しておきます。

## Azure AI Searchのエンドポイント 
@searchUrl = https://<ai searchresource name>.search.windows.net

## Azure AI SearchのAPI Key
@searchApiKey=<ai search key>

## Blob Storageの接続文字列
@storageConnection=<connection string>

## Blob Storageのコンテナ名(画像格納)
@imageProjectionContainer=images

## Blob Storageのコンテナ名(PDF格納)
@blob_container_name=rag-doc-test

## Document Intelligence のAI Servicesのエンドポイント
@cognitiveServicesUrl = https://<AI Services resource name>.cognitiveservices.azure.com/

# Embedding の情報
@openAIResourceUri = https://<aoai resource name>.openai.azure.com
@openAIKey = <aoai key>

# Chat Completion の情報(画像のテキスト化)
@chatCompletionResourceUri = https://<aoai researce name>.openai.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2025-01-01-preview
@chatCompletionKey = <aoai key>

1.1. データソースを作成

公式チュートリアルと異なり、ADLS Gen2を使用。

### Create a data source
POST {{searchUrl}}/datasources?api-version=2025-08-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{searchApiKey}}

  {
    "name": "doc-intelligence-image-verbalization-ds",
    "description": "A data source to store multi-modality documents",
    "type": "adlsgen2",
    "subtype": null,
    "credentials":{
      "connectionString":"{{storageConnection}}"
    },
    "container": {
      "name": "{{blob_container_name}}",
      "query": null
    },
    "dataChangeDetectionPolicy": null,
    "dataDeletionDetectionPolicy": null,
    "encryptionKey": null,
    "identity": null
  }

1.2. インデックスを作成

公式チュートリアルと以下を変更

location_metadataからlocationMetadataに項目名を修正
page_numberからpageNumberに項目名修正
bounding_polygonsからboundingPolygonsに項目名修正
offsetは削除(不使用なので)

### Create an index
POST {{searchUrl}}/indexes?api-version=2025-08-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{searchApiKey}}

{
    "name": "doc-intelligence-image-verbalization-index",
    "fields": [
        {
            "name": "content_id",
            "type": "Edm.String",
            "retrievable": true,
            "key": true,
            "analyzer": "keyword"
        },
        {
            "name": "text_document_id",
            "type": "Edm.String",
            "searchable": false,
            "filterable": true,
            "retrievable": true,
            "stored": true,
            "sortable": false,
            "facetable": false
        },          
        {
            "name": "document_title",
            "type": "Edm.String",
            "searchable": true
        },
        {
            "name": "image_document_id",
            "type": "Edm.String",
            "filterable": true,
            "retrievable": true
        },
        {
            "name": "content_text",
            "type": "Edm.String",
            "searchable": true,
            "retrievable": true
        },
        {
            "name": "content_embedding",
            "type": "Collection(Edm.Single)",
            "dimensions": 3072,
            "searchable": true,
            "retrievable": true,
            "vectorSearchProfile": "hnsw"
        },
        {
            "name": "content_path",
            "type": "Edm.String",
            "searchable": false,
            "retrievable": true
        },
        {
            "name": "locationMetadata",
            "type": "Edm.ComplexType",
            "fields": [
                {
                "name": "pageNumber",
                "type": "Edm.Int32",
                "searchable": false,
                "retrievable": true
                },
                {
                "name": "boundingPolygons",
                "type": "Edm.String",
                "searchable": false,
                "retrievable": true,
                "filterable": false,
                "sortable": false,
                "facetable": false
                }
            ]
        }         
    ],
    "vectorSearch": {
        "profiles": [
            {
                "name": "hnsw",
                "algorithm": "defaulthnsw",
                "vectorizer": "demo-vectorizer"
            }
        ],
        "algorithms": [
            {
                "name": "defaulthnsw",
                "kind": "hnsw",
                "hnswParameters": {
                    "m": 4,
                    "efConstruction": 400,
                    "metric": "cosine"
                }
            }
        ],
        "vectorizers": [
            {
              "name": "demo-vectorizer",
              "kind": "azureOpenAI",    
              "azureOpenAIParameters": {
                "resourceUri": "{{openAIResourceUri}}",
                "deploymentId": "text-embedding-3-large",
                "apiKey": "{{openAIKey}}",
                "modelName": "text-embedding-3-large"
              }
            }
        ]
    },
    "semantic": {
        "defaultConfiguration": "semanticconfig",
        "configurations": [
            {
                "name": "semanticconfig",
                "prioritizedFields": {
                    "titleField": {
                        "fieldName": "document_title"
                    },
                    "prioritizedContentFields": [
                    ],
                    "prioritizedKeywordsFields": []
                }
            }
        ]
    }
}

1.3. スキルセットを作成

特にチュートリアルから変更点ありませんでした。

### Create a skillset
POST {{searchUrl}}/skillsets?api-version=2025-08-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{searchApiKey}}

{
  "name": "doc-intelligence-image-verbalization-skillset",
  "description": "A sample skillset for multi-modality using image verbalization",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill",
      "name": "document-cracking-skill",
      "description": "Document Layout skill for document cracking",
      "context": "/document",
      "outputMode": "oneToMany",
      "outputFormat": "text",
      "extractionOptions": ["images", "locationMetadata"],
      "chunkingProperties": {     
          "unit": "characters",
          "maximumLength": 2000, 
          "overlapLength": 200
      },
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/file_data"
        }
      ],
      "outputs": [
        { 
          "name": "text_sections", 
          "targetName": "text_sections" 
        }, 
        { 
          "name": "normalized_images", 
          "targetName": "normalized_images" 
        } 
      ]
    },
    {
    "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
    "name": "text-embedding-skill",
    "description": "Azure Open AI Embedding skill for text",
    "context": "/document/text_sections/*",
    "inputs": [
        {
        "name": "text",
        "source": "/document/text_sections/*/content"
        }
    ],
    "outputs": [
        {
        "name": "embedding",
        "targetName": "text_vector"
        }
    ],
    "resourceUri": "{{openAIResourceUri}}",
    "deploymentId": "text-embedding-3-large",
    "apiKey": "{{openAIKey}}",
    "dimensions": 3072,
    "modelName": "text-embedding-3-large"
    },
    {
    "@odata.type": "#Microsoft.Skills.Custom.ChatCompletionSkill",
    "uri": "{{chatCompletionResourceUri}}",
    "timeout": "PT1M",
    "apiKey": "{{chatCompletionKey}}",
    "name": "genAI-prompt-skill",
    "description": "GenAI Prompt skill for image verbalization",
    "context": "/document/normalized_images/*",
    "inputs": [
        {
        "name": "systemMessage",
        "source": "='You are tasked with generating concise, accurate descriptions of images, figures, diagrams, or charts in documents. The goal is to capture the key information and meaning conveyed by the image without including extraneous details like style, colors, visual aesthetics, or size.\n\nInstructions:\nContent Focus: Describe the core content and relationships depicted in the image.\n\nFor diagrams, specify the main elements and how they are connected or interact.\nFor charts, highlight key data points, trends, comparisons, or conclusions.\nFor figures or technical illustrations, identify the components and their significance.\nClarity & Precision: Use concise language to ensure clarity and technical accuracy. Avoid subjective or interpretive statements.\n\nAvoid Visual Descriptors: Exclude details about:\n\nColors, shading, and visual styles.\nImage size, layout, or decorative elements.\nFonts, borders, and stylistic embellishments.\nContext: If relevant, relate the image to the broader content of the technical document or the topic it supports.\n\nExample Descriptions:\nDiagram: \"A flowchart showing the four stages of a machine learning pipeline: data collection, preprocessing, model training, and evaluation, with arrows indicating the sequential flow of tasks.\"\n\nChart: \"A bar chart comparing the performance of four algorithms on three datasets, showing that Algorithm A consistently outperforms the others on Dataset 1.\"\n\nFigure: \"A labeled diagram illustrating the components of a transformer model, including the encoder, decoder, self-attention mechanism, and feedforward layers.\"'"
        },
        {
        "name": "userMessage",
        "source": "='Please describe this image.'"
        },
        {
        "name": "image",
        "source": "/document/normalized_images/*/data"
        }
        ],
        "outputs": [
            {
            "name": "response",
            "targetName": "verbalizedImage"
            }
        ]
    },    
    {
    "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
    "name": "verbalizedImage-embedding-skill",
    "description": "Azure Open AI Embedding skill for verbalized image embedding",
    "context": "/document/normalized_images/*",
    "inputs": [
        {
        "name": "text",
        "source": "/document/normalized_images/*/verbalizedImage",
        "inputs": []
        }
    ],
    "outputs": [
        {
        "name": "embedding",
        "targetName": "verbalizedImage_vector"
        }
    ],
    "resourceUri": "{{openAIResourceUri}}",
    "deploymentId": "text-embedding-3-large",
    "apiKey": "{{openAIKey}}",
    "dimensions": 3072,
    "modelName": "text-embedding-3-large"
    },
    {
      "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
      "name": "#5",
      "context": "/document/normalized_images/*",
      "inputs": [
        {
          "name": "normalized_images",
          "source": "/document/normalized_images/*",
          "inputs": []
        },
        {
          "name": "imagePath",
          "source": "='my_container_name/'+$(/document/normalized_images/*/imagePath)",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "output",
          "targetName": "new_normalized_images"
        }
      ]
    }      
  ], 
   "indexProjections": {
      "selectors": [
        {
          "targetIndexName": "doc-intelligence-image-verbalization-index",
          "parentKeyFieldName": "text_document_id",
          "sourceContext": "/document/text_sections/*",
          "mappings": [    
            {
            "name": "content_embedding",
            "source": "/document/text_sections/*/text_vector"
            },                      
            {
              "name": "content_text",
              "source": "/document/text_sections/*/content"
            },
            {
              "name": "locationMetadata",
              "source": "/document/text_sections/*/locationMetadata"
            },                
            {
              "name": "document_title",
              "source": "/document/document_title"
            }   
          ]
        },        
        {
          "targetIndexName": "doc-intelligence-image-verbalization-index",
          "parentKeyFieldName": "image_document_id",
          "sourceContext": "/document/normalized_images/*",
          "mappings": [    
            {
            "name": "content_text",
            "source": "/document/normalized_images/*/verbalizedImage"
            },  
            {
            "name": "content_embedding",
            "source": "/document/normalized_images/*/verbalizedImage_vector"
            },                                           
            {
              "name": "content_path",
              "source": "/document/normalized_images/*/new_normalized_images/imagePath"
            },                    
            {
              "name": "document_title",
              "source": "/document/document_title"
            },
            {
              "name": "locationMetadata",
              "source": "/document/normalized_images/*/locationMetadata"
            }             
          ]
        }
      ],
      "parameters": {
        "projectionMode": "skipIndexingParentDocuments"
      }
  },  
  "knowledgeStore": {
    "storageConnectionString": "{{storageConnection}}",
    "identity": null,
    "projections": [
      {
        "files": [
          {
            "storageContainer": "{{imageProjectionContainer}}",
            "source": "/document/normalized_images/*"
          }
        ]
      }
    ]
  }
}

AI 使って項目のフローをマーメイド記法で書きました。少し不足はありますが、正しいです。
ただ、Qiitaで見ると小さいので以下のツールなどを使ってみてください。

1.4. インデクサーの作成と実行

name がなかったので追加

### Create and run an indexer
POST {{searchUrl}}/indexers?api-version=2025-08-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{searchApiKey}}

{
  "name": "doc-intelligence-image-verbalization-indexer",
  "dataSourceName": "doc-intelligence-image-verbalization-ds",
  "targetIndexName": "doc-intelligence-image-verbalization-index",
  "skillsetName": "doc-intelligence-image-verbalization-skillset",
  "parameters": {
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": 0,
    "batchSize": 1,
    "configuration": {
      "allowSkillsetToReadFileData": true
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "document_title"
    }
  ],
  "outputFieldMappings": []
}

2. 検索

2.1. フル検索

locationMetadataも出力したフル検索

### Query the index
POST {{searchUrl}}/indexes/doc-intelligence-image-verbalization-index/docs/search?api-version=2025-08-01-preview   HTTP/1.1
  Content-Type: application/json
  api-key: {{searchApiKey}}
  
  {
    "search": "*",
    "count": true,
    "select": "content_id, document_title, content_text, content_path, locationMetadata"
  }

画像の検索結果です。テキスト側も似たようにboundingPolygonsが出ます。

検索結果(画像エントリ抜粋)

    {
      "@search.score": 1.0,
      "content_id": "c2f566e3de96_aHR0cHM6Ly9zdG9yYWdlcmFnanBlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9yYWctZG9jLXRlc3QvMDEucGRm0_normalized_images_0",
      "document_title": "01.pdf",
      "content_text": "The image appears to be a logo featuring two stylized letter \"S\" shapes that intertwine, suggesting movement or flow. The design emphasizes a dynamic interaction between the elements, although specific details such as colors or visual styles are not described. This logo likely represents a brand or organization, with the letters serving as a central focus.",
      "content_path": "my_container_name/aHR0cHM6Ly9zdG9yYWdlcmFnanBlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9yYWctZG9jLXRlc3QvMDEucGRm0/normalized_images_0.jpg",
      "locationMetadata": {
        "pageNumber": 2,
        "boundingPolygons": "[[{\"x\":0.0474,\"y\":0.0402},{\"x\":0.5967,\"y\":0.0403},{\"x\":0.5968,\"y\":0.5351},{\"x\":0.0475,\"y\":0.535}]]"
      }
    },

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up