1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

AI Search: Content Understandingでマークダウン化と画像テキスト化(REST API)

1
Last updated at Posted at 2026-02-23

Azure AI SearchでFoundry Tools の Azure Content Understanding を使ってファイルをマークダウンフォーマット化、画像切り出し、言語化をしてインデックス化しました。

以下がサポートされているリージョンです。

やっていることは以下と似ているのですが、Document IntelligenceではSectionでChunk分割できるのに対して、Content Understandingではできません。一方で、Content Understandingでは一発で画像切り抜きと表解析のマークダウン化ができるのが強みです。

完成時のパイプライン

デバッガセッション使うとこんなパイプラインに可視化できます。

image.png

前提

REST Client 0.25.1 をVS Codeから使って実行しています。

REST Client 設定で Decode Escaped Unicode Characters を ON にするとHTTP Response Bodyの日本語がデコードされます。

また、以下の記事の1~5までのStepも前提作業です。

Steps

1. インデックス作成

試行錯誤目的で各ステップに削除も末尾に乗せています。

1.0. 固定値定義

固定値を定義しておきます。

## Azure AI Searchのエンドポイント 
@endpoint = https://<ai searchresource name>.search.windows.net
## インデックス名
@index_name=test-index-cu-small02
## スキルセット名
@skillset_name=test-skillset-cu-small02
## データソース名
@datasource_name=test-datasource-cu-small02
## インデクサー名
@indexer_name=test-indexer-cu-small02
## Azure AI SearchのAPI Key
@admin_key=<ai search key>
## Azure AI SearchのAPI Version
@api_version=2025-11-01-Preview
## AOAIのリソース名
@aoai_resourceUri=https://<AOAI resource name>.openai.azure.com
## AOAIのAPI Key
@aoai_apiKey=<key>
## Embeddingモデルのデプロイメント名
@aoai_embedding_deploymentId=text-embedding-3-large
## Embeddingモデルのモデル名
@aoai_embedding_modelName=text-embedding-3-large
## Embeddingモデルの次元数
@aoai_embedding_dimension=3072
## Blob Storageのコンテナ名
@blob_container_name=rag-doc-test
## Blob Storageの接続文字列
@blob_connectionString=<connection string>
# Chat Completion の情報(画像のテキスト化)。デプロイ名も変更必要。
@chatCompletionResourceUri = https://<aoai resource name>.openai.azure.com/openai/deployments/<deploy>/chat/completions?api-version=2025-01-01-preview
# Chat Completion のAPI Key
@chatCompletionKey = <key>
## Blob Storageのコンテナ名(画像格納)
@imageProjectionContainer=images
## AI ServiceのAPI Key
@ai_service_key=<key>
## AI Serviceのendポイント
@ai_service_endpoint=https://<resource name>/cognitiveservices.azure.com

1.1. データ ソースを作成

ADLS Gen2を使っています。

### データソース更新
PUT {{endpoint}}/datasources('{{datasource_name}}')?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{datasource_name}}",
  "description": null,
  "type": "adlsgen2",
  "subtype": null,
  "credentials": {
    "connectionString": "{{blob_connectionString}};"
  },
  "container": {
    "name": "{{blob_container_name}}",
    "query": null
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": null,
  "encryptionKey": null
}

### データソース削除
DELETE {{endpoint}}/datasources/{{datasource_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

1.2. インデックスを作成

日本語項目のAnalyzerはja.luceneにしています。ja.microsoftでもいいかと思います。
DocumentIntelligenceと違ってSection見出しを抽出してくれないのが残念。

### インデックス作成
### インデックス作成
PUT {{endpoint}}/indexes('{{index_name}}')?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{index_name}}",
  "fields": [
    {
      "name": "chunk_id",
      "type": "Edm.String",
      "key": true,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "filterable": false,
      "sortable": true,
      "facetable": false,
      "analyzer": "keyword",
      "synonymMaps": []
    },
    {
      "name": "parent_id",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": false,
      "filterable": true,
      "sortable": false,
      "facetable": false,
      "synonymMaps": []
    },
    {
      "name": "content_text",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "analyzer": "ja.lucene",
      "filterable": false,
      "sortable": false,
      "facetable": false,
      "synonymMaps": []
    },
    {
      "name": "title",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "analyzer": "ja.lucene",
      "filterable": true,
      "sortable": true,
      "facetable": false,
      "synonymMaps": []
    },
    {
      "name": "content_embedding",
      "type": "Collection(Edm.Single)",
      "key": false,
      "retrievable": false,
      "stored": false,
      "searchable": true,
      "filterable": false,
      "sortable": false,
      "facetable": false,
      "synonymMaps": [],
      "dimensions": {{aoai_embedding_dimension}},
      "vectorSearchProfile": "azureOpenAi-text-profile"
    },
    {
        "name": "image_document_id",
        "type": "Edm.String",
        "filterable": true,
        "retrievable": true
    },
    {
        "name": "content_path",
        "type": "Edm.String",
        "searchable": false,
        "retrievable": true
    },
    {
        "name": "locationMetadata",
        "type": "Edm.ComplexType",
        "fields": [
            {
            "name": "pageNumberFrom",
            "type": "Edm.Int32",
            "searchable": false,
            "retrievable": true
            },
            {
            "name": "pageNumberTo",
            "type": "Edm.Int32",
            "searchable": false,
            "retrievable": true
            },
            {
            "name": "ordinalPosition",
            "type": "Edm.Int32",
            "searchable": false,
            "retrievable": true
            },
            {
            "name": "source",
            "type": "Edm.String",
            "searchable": false,
            "retrievable": true,
            "filterable": false,
            "sortable": false,
            "facetable": false
            }
        ]
    }  
  ],
  "scoringProfiles": [],
  "suggesters": [],
  "analyzers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "normalizers": [],
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
  },
  "semantic": {
    "defaultConfiguration": "semantic-configuration",
    "configurations": [
      {
        "name": "semantic-configuration",
        "prioritizedFields": {
          "titleField": {
            "fieldName": "title"
          },
          "prioritizedContentFields": [
            {
              "fieldName": "content_text"
            }
          ],
          "prioritizedKeywordsFields": []
        }
      }
    ]
  },
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vector-algorithm",
        "kind": "hnsw",
        "hnswParameters": {
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500,
          "metric": "cosine"
        }
      }
    ],
    "profiles": [
      {
        "name": "azureOpenAi-text-profile",
        "algorithm": "vector-algorithm",
        "vectorizer": "azureOpenAi-text-vectorizer"
      }
    ],
    "vectorizers": [
      {
        "name": "azureOpenAi-text-vectorizer",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
          "resourceUri": "{{aoai_resourceUri}}",
          "deploymentId": "{{aoai_embedding_deploymentId}}",
          "apiKey": "{{aoai_apiKey}}",
          "modelName": "{{aoai_embedding_modelName}}"
        }
      }
    ],
    "compressions": []
  }
}

### インデックス削除
DELETE {{endpoint}}/indexes('{{index_name}}')?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

image.png

1.3. スキルセットを作成

api-versionはperviewでないやつだとエラーになったので、previewにしています。
画像化のPromptは日本語にしています。
Chunk Sizeとオーバーラップを日本語に合わせて、英語より少な目の数値にしました。
メインとなるContent Understandingのスキルリンクです。

「GenAI プロンプト スキル」のパラメータは以下を参考にしました。

### Skillset作成
PUT {{endpoint}}/skillsets/{{skillset_name}}?api-version={{api_version}}
content-type: application/json
api-key: {{admin_key}}

{
  "name": "{{skillset_name}}",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.ContentUnderstandingSkill",
      "name": "contentUnderstandingSkill",
      "description": "Extract text and metadata from documents, and convert to markdown format",
      "context": "/document",
      "extractionOptions": [
        "images",
        "locationMetadata"
      ],
      "inputs": [
        { "name": "file_data", "source": "/document/file_data"}
      ],
      "outputs": [
        { "name": "text_sections", "targetName": "text_sections" },
        { "name": "normalized_images", "targetName": "normalized_images" }
      ],
      "chunkingProperties": {
        "unit": "characters",
        "maximumLength": 1250,
        "overlapLength": 250
      }
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "text_embedding_skill",
      "description": "Generate embeddings",
      "context": "/document/text_sections/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/text_sections/*/content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ],
      "resourceUri": "{{aoai_resourceUri}}",
      "deploymentId": "{{aoai_embedding_deploymentId}}",
      "apiKey": "{{aoai_apiKey}}",
      "modelName": "{{aoai_embedding_modelName}}",
      "dimensions": {{aoai_embedding_dimension}}
    },
    {
    "@odata.type": "#Microsoft.Skills.Custom.ChatCompletionSkill",
    "name": "genAI_prompt_skill",
    "description": "GenAI Prompt skill for image verbalization",
    "uri": "{{chatCompletionResourceUri}}",
    "timeout": "PT3M50S",
    "apiKey": "{{chatCompletionKey}}",
    "extraParameters": {
      "reasoning_effort": "low"
    },
    "extraParametersBehavior": "pass-through",
    "context": "/document/normalized_images/*",
    "inputs": [
        {
        {
          "name": "systemMessage",
          "source": "='あなたはPDFから切り出した図版画像を、RAG用のプレーンテキストに変換します。\\n\\nルール:\\n- 画像内の文字を可能な限り漏れなく抽出(日本語/英語混在そのまま)\\n- 読み順は「上→下、左→右」を基本に、図のレイアウト(列/行/枠/矢印)を優先する\\n- 文章としてつながるように整形するが、勝手な補完や推測はしない\\n- 読めない箇所は [illegible] と書く\\n- 出力はプレーンテキストのみ(JSON禁止)\\n\\n出力フォーマット:\\n- 先頭にメタ情報行を置く(後でチャンクしても出どころが残るように)\\n 例:\\n  [source_doc=...][page=...][figure_id=...]\\n- その後、図の要旨を1〜2文で書く(見えている範囲で。推測しない)\\n  例: \"要旨: ...\"\\n- 以降は、図のテキストをブロック順に列挙する\\n  - 大きな区切りは \"##\" 見出し\\n  - 小見出し/レーンは \"-\" 箇条書き\\n  - 箇条書きの連続はまとめてよい'"
        },
        {
          "name": "userMessage",
          "source": "='添付画像はPDF内の図版です。RAG用にプレーンテキスト化してください。\\n\\nメタ情報:\\n- source_doc: \"<PDF名またはID>\"\\n- page: <ページ番号>\\n- figure_id: \"<任意>\"\\n\\n要件:\\n- 画像内の文字は可能な限り漏れなく\\n- 読めない箇所は [illegible]\\n- 出力はプレーンテキストのみ'"
        },
        {
        "name": "image",
        "source": "/document/normalized_images/*/data"
        }
        ],
        "outputs": [
            {
            "name": "response",
            "targetName": "verbalizedImage"
            }
        ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "verbalized-image-embedding-skill",
      "description": "Embedding skill for verbalized images",
      "context": "/document/normalized_images/*",
      "inputs": [
          {
          "name": "text",
          "source": "/document/normalized_images/*/verbalizedImage",
          "inputs": []
          }
      ],
      "outputs": [
          {
          "name": "embedding",
          "targetName": "verbalizedImage_vector"
          }
      ],
      "resourceUri": "{{aoai_resourceUri}}",
      "deploymentId": "{{aoai_embedding_deploymentId}}",
      "apiKey": "{{aoai_apiKey}}",
      "dimensions": {{aoai_embedding_dimension}},
      "modelName": "{{aoai_embedding_deploymentId}}"
    },
    {
      "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
      "name": "shaper-skill",
      "description": "Shaper skill to reshape the data to fit the index schema",
      "context": "/document/normalized_images/*",
      "inputs": [
        {
          "name": "imagePath",
          "source": "='{{imageProjectionContainer}}/'+$(/document/normalized_images/*/imagePath)",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "output",
          "targetName": "new_normalized_images"
        }
      ]
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.AIServicesByKey",
    "key": "{{ai_service_key}}",
    "subdomainUrl": "{{ai_service_endpoint}}"
  },
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "{{index_name}}",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/text_sections/*",
        "mappings": [
          {
            "name": "content_embedding",
            "source": "/document/text_sections/*/text_vector"
          },
          {
            "name": "content_text",
            "source": "/document/text_sections/*/content"
          },
          {
            "name": "locationMetadata",
            "source": "/document/text_sections/*/locationMetadata"
          },
          {
            "name": "title",
            "source": "/document/title"
          }
        ]
      },
        {
          "targetIndexName": "{{index_name}}",
          "parentKeyFieldName": "image_document_id",
          "sourceContext": "/document/normalized_images/*",
          "mappings": [    
            {
            "name": "content_text",
            "source": "/document/normalized_images/*/verbalizedImage"
            },  
            {
            "name": "content_embedding",
            "source": "/document/normalized_images/*/verbalizedImage_vector"
            },                                           
            {
              "name": "content_path",
              "source": "/document/normalized_images/*/new_normalized_images/imagePath"
            },                    
            {
              "name": "title",
              "source": "/document/title"
            },
            {
              "name": "locationMetadata",
              "source": "/document/normalized_images/*/locationMetadata"
            }            
          ]
        }    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

### スキルセット削除
DELETE {{endpoint}}/skillsets/{{skillset_name}}?api-version={{api_version}}
content-type: application/json
api-key: {{admin_key}}

AI 使って項目のフローをマーメイド記法で書きました。これを作った後に微修正(text側もlocationMetadataをマッピング)しているので、完全に正しくないです。
ただ、Qiitaで見ると小さいので以下のツールなどを使ってみてください。

1.4. インデクサー登録

特になしです。

### インデクサー作成
PUT {{endpoint}}/indexers/{{indexer_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{indexer_name}}",
  "description": null,
  "dataSourceName": "{{datasource_name}}",
  "skillsetName": "{{skillset_name}}",
  "targetIndexName": "{{index_name}}",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "allowSkillsetToReadFileData": true
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "title",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [],
  "encryptionKey": null
}

### インデクサー削除
DELETE {{endpoint}}/indexers/{{indexer_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

2. 検索

2.1. フル検索

locationMetadataも出力したフル検索

### Query the index
POST {{endpoint}}/indexes/{{index_name}}/docs/search?api-version={{api_version}}
  Content-Type: application/json
  api-key: {{admin_key}}
  
  {
    "search": "*",
    "count": true,
    "select": "chunk_id, content_text, title, content_path, image_document_id, locationMetadata"
  }

画像の検索結果です。テキスト側も似たようにboundingPolygonsが出ます。

検索結果(抜粋)
{
  "@odata.context": "https://ais-rag-ncus.search.windows.net/indexes('test-index-cu-small02')/$metadata#docs(*)",
  "@odata.count": 31,
  "value": [
    {
      "@search.score": 1.0,
      "chunk_id": "5c415b8fa3ae_aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2_text_sections_2",
      "content_text": "1年にFTTH (Fiber To The Hom<後略>",
      "title": "01point.pdf",
      "image_document_id": null,
      "content_path": null,
      "locationMetadata": null
    },
    {
      "@search.score": 1.0,
      "chunk_id": "5c415b8fa3ae_aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2_normalized_images_3",
      "content_text": "---\n【抽出テキスト】\n- その 後略",
      "title": "01point.pdf",
      "image_document_id": "aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2",
      "content_path": "images/aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2/normalized_images_3.jpg",
      "locationMetadata": {
        "pageNumberFrom": 4,
        "pageNumberTo": 4,
        "ordinalPosition": 3,
        "source": "D(4,0.9114,3.0314,4.0109,3.0142,4.0351,5.2671,0.9175,5.287)"
      }
    },

2.2. 条件つき検索

### Query the index with filter
POST {{endpoint}}/indexes/{{index_name}}/docs/search?api-version={{api_version}}
  Content-Type: application/json
  api-key: {{admin_key}}
  
  {
    "search": "*",
    "count": true,
    "filter": "title eq '01point.pdf' and locationMetadata/pageNumberFrom le 2",
    "orderby": "title asc, locationMetadata/pageNumberFrom asc, locationMetadata/ordinalPosition asc",
    "select": "chunk_id, content_text, title, content_path, image_document_id, locationMetadata/pageNumberFrom, locationMetadata/pageNumberTo, locationMetadata/ordinalPosition"
  }

検索結果
{
  "@odata.context": "https://ais-rag-ncus.search.windows.net/indexes('test-index-cu-small03')/$metadata#docs(*)",
  "@odata.count": 4,
  "value": [
    {
      "@search.score": 1.0,
      "chunk_id": "ccfab6b43e18_aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2_text_sections_0",
      "content_text": "令和5年 情報通信に関す  後略",
      "title": "01point.pdf",
      "image_document_id": null,
      "content_path": null,
      "locationMetadata": {
        "pageNumberFrom": 1,
        "pageNumberTo": 2,
        "ordinalPosition": 0
      }
    },
    {
      "@search.score": 1.0,
      "chunk_id": "902ef4b70e6f_aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2_normalized_images_0",
      "content_text": "[source_doc=<PDF名またはID>][page=<ページ番号>] 後略",
      "title": "01point.pdf",
      "image_document_id": "aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2",
      "content_path": "images/aHR0cHM6Ly9zdG9yYWdlcmFnbmN1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvdGVzdC8wMXBvaW50LnBkZg2/normalized_images_0.jpg",
      "locationMetadata": {
        "pageNumberFrom": 2,
        "pageNumberTo": 2,
        "ordinalPosition": 0
      }
    },

更新情報

  • 2026/3/6: Indexでベクトル項目を保存・取得しないように変更(Disk容量節約のため)
  • 2026/3/6: SkillSet でChat部分のSkillをタイムアウト230秒に増加(よく落ちるため)
  • 2026/3/6: SkillSet でChat部分のSkillでReasoning Effort追加(gpt-5.2のデフォルトのMediumだと遅いため)
1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?