AI Search: PDFファイルをインデックス化(REST API)

Last updated at 2025-10-24Posted at 2025-10-15

Azure AI SearchのAPIとして今までPython SDKをよく使っていましたが、REST APIの方が便利な点も多いので、まずは基本的なインデクサー作成に使ってみました。
内容は、以下のPythonでやったのとほぼ同じ手順をRESTで実行しています。

正確性確認してませんが、AIに聞いたらREST API と Python SDK で以下の比較(ぱっと見合っていますが、ハルシネーションあるかもしれないので注意してください。)。

観点	REST API	Python SDK
対応範囲	常に最速で新機能に到達（APIが唯一のソース）	多くは網羅。新機能は一時的に遅れることあり
実装量	リクエスト組み立てが冗長。スキーマJSON管理が直感的	型補完・モデルが便利。ボイラープレート削減
型安全・補助	自作（dict/JSON）。バリデーションは自前	型・補完・バリデーション、再試行・分页など内蔵
認証	APIキー／Azure AD（Bearer）を自前実装	`AzureKeyCredential` / `DefaultAzureCredential` で簡潔
デバッグ性	素のHTTPなので透過的（Fiddler/curlで追跡容易）	ログ・診断フックあり。HTTP詳細は抽象化される
依存関係	追加ライブラリ不要（`requests`等のみ）	`azure-search-documents` が必要
可搬性	任意言語・環境で同一パターン	Python専用（他言語は各SDK）
学習コスト	RESTのリソースモデル理解が必要	オブジェクトAPIで学習曲線は緩やか
高度機能	ベクター構成、ハイブリッド、セマンティック等を即利用可	概ね対応。構成オブジェクトで扱いやすい

完成図

これから作るインデックスおよびインデクサーです。
Chunk size文字数でスプリットして、Embeddingするだけのシンプルな流れです。
すべて作成した後に、デバッグセッションで可視化しています。

前提

REST Client 0.25.1 をVS Codeから使って実行しています。

REST Client 設定で Decode Escaped Unicode Characters を ON にするとHTTP Response Bodyの日本語がデコードされます。

また、以下の記事の1～5までのStepも前提作業です。

REST

インデックス作成

オペレーションはすべて登録/更新(Upsert)にしています。
削除したい時ように削除用のコマンドも各末尾に記載

固定値定義

## Azure AI Searchのエンドポイント 
@endpoint=https://<ai search resource>.search.windows.net
## インデックス名
@index_name=test-index-1
## スキルセット名
@skillset_name=test-skillset-1
## データソース名
@datasource_name=test-datasource-1
## インデクサー名
@indexer_name=test-indexer-1
## Azure AI SearchのAPI Key
@admin_key=<key>
## Azure AI SearchのAPI Version
@api_version=2025-09-01
## AOAIのリソース名
@aoai_resourceUri=https://<aoai resource>.openai.azure.com
## AOAIのAPI Key
@aoai_apiKey=<key>
## Embeddingモデルのデプロイメント名
@aoai_embedding_deploymentId=text-embedding-3-small
## Embeddingモデルのモデル名
@aoai_embedding_modelName=text-embedding-3-small
## Blob Storageのコンテナ名
@blob_container_name=<blob resource name>
## Blob Storageの接続文字列
@blob_connectionString=<blob connection string>

インデックス定義

### インデックス作成
PUT {{endpoint}}/indexes('{{index_name}}')?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{index_name}}",
  "fields": [
    {
      "name": "parent_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": true,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": true,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "chunk_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": true,
      "key": true,
      "analyzer": "keyword",
      "synonymMaps": []
    },
    {
      "name": "chunk",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "text_vector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": false,
      "stored": false,
      "sortable": false,
      "facetable": false,
      "key": false,
      "dimensions": 1536,
      "vectorSearchProfile": "azureOpenAi-text-profile",
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "suggesters": [],
  "analyzers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "normalizers": [],
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
  },
  "semantic": {
    "defaultConfiguration": "semantic-configuration",
    "configurations": [
      {
        "name": "semantic-configuration",
        "prioritizedFields": {
          "titleField": {
            "fieldName": "title"
          },
          "prioritizedContentFields": [
            {
              "fieldName": "chunk"
            }
          ],
          "prioritizedKeywordsFields": []
        }
      }
    ]
  },
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vector-algorithm",
        "kind": "hnsw",
        "hnswParameters": {
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500,
          "metric": "cosine"
        }
      }
    ],
    "profiles": [
      {
        "name": "azureOpenAi-text-profile",
        "algorithm": "vector-algorithm",
        "vectorizer": "azureOpenAi-text-vectorizer"
      }
    ],
    "vectorizers": [
      {
        "name": "azureOpenAi-text-vectorizer",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
          "resourceUri": "{{aoai_resourceUri}}",
          "deploymentId": "{{aoai_embedding_deploymentId}}",
          "apiKey": "{{aoai_apiKey}}",
          "modelName": "{{aoai_embedding_modelName}}"
        }
      }
    ],
    "compressions": []
  }
}

### インデックス削除
DELETE {{endpoint}}/indexes/{{index_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

実行後のフィールド定義。

データソース登録

### データソース更新
PUT {{endpoint}}/datasources('{{datasource_name}}')?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{datasource_name}}",
  "description": null,
  "type": "adlsgen2",
  "subtype": null,
  "credentials": {
    "connectionString": "{{blob_connectionString}};"
  },
  "container": {
    "name": "{{blob_container_name}}",
    "query": null
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": null,
  "encryptionKey": null
}

### データソース削除
DELETE {{endpoint}}/datasources/{{datasource_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

スキルセット更新

以下が使っているスキルです。テキスト分割スキルは、定義内容が少しわかりにくいです。textSplitModeでunitが指定なし(デフォルトのcharacter)なので、miaximumPageLengthを文字数として扱う。

### Skillset作成
PUT {{endpoint}}/skillsets/{{skillset_name}}?api-version={{api_version}}
content-type: application/json
api-key: {{admin_key}}

{
  "name": "{{skillset_name}}",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "text_split_skill",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "text_embedding_skill",
      "description": "Skill to generate embeddings via Azure OpenAI",
      "context": "/document/pages/*",
      "resourceUri": "{{aoai_resourceUri}}",
      "deploymentId": "{{aoai_embedding_deploymentId}}",
      "apiKey": "{{aoai_apiKey}}",
      "dimensions": 1536,
      "modelName": "{{aoai_embedding_modelName}}",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ]
    }
  ],
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "{{index_name}}",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "chunk",
            "source": "/document/pages/*",
            "inputs": []
          },
          {
            "name": "text_vector",
            "source": "/document/pages/*/text_vector",
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/metadata_storage_name",
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

### スキルセット削除
DELETE {{endpoint}}/skillsets/{{skillset_name}}?api-version={{api_version}}
content-type: application/json
api-key: {{admin_key}}

インデクサー登録

### インデクサー作成
PUT {{endpoint}}/indexers/{{indexer_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

{
  "name": "{{indexer_name}}",
  "description": null,
  "dataSourceName": "{{datasource_name}}",
  "skillsetName": "{{skillset_name}}",
  "targetIndexName": "{{index_name}}",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "allowSkillsetToReadFileData": true
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "title",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [],
  "encryptionKey": null
}

### インデクサー削除
DELETE {{endpoint}}/indexers/{{indexer_name}}?api-version={{api_version}}
Content-Type: application/json
api-key: {{admin_key}}

検索

主に以下を参照して実行

インデックス一覧取得

検索ではありませんが、使うと思い記載。

### インデックス一覧取得
GET {{endpoint}}/indexes?api-version={{api_version}}  HTTP/1.1
Content-Type: application/json
api-key: {{admin_key}}

フルテキスト検索

POST {{endpoint}}/indexes/{{index_name}}/docs/search?api-version={{api_version}}  HTTP/1.1
Content-Type: application/json
api-key: {{admin_key}}

  {
      "search": "検索語句",
      "select": "parent_id, title, chunk, chunk_id",
      "searchFields": "title, chunk",
      "count": true
  }

ベクトル検索

ベクトルで検索

POST {{endpoint}}/indexes/{{index_name}}/docs/search?api-version={{api_version}}  HTTP/1.1
Content-Type: application/json
api-key: {{admin_key}}

    {
        "count": true,
        "select": "parent_id, title, chunk, chunk_id",
        "vectorQueries": [
            {
                "vector": [-0.045507785,0.028645637, 後略],
                "k": 5,
                "fields": "text_vector",
                "kind": "vector",
                "exhaustive": true
            }
        ]
    }

テキストで検索

内部で自動でEmbeddingに変換させます。kindの値をtextにするとできます。

### ベクトル
POST {{endpoint}}/indexes/{{index_name}}/docs/search?api-version={{api_version}}  HTTP/1.1
Content-Type: application/json
api-key: {{admin_key}}

    {
        "count": true,
        "select": "parent_id, title, chunk, chunk_id",
        "vectorQueries": [
            {
                "text": "検索語句",
                "k": 5,
                "fields": "text_vector",
                "kind": "text",
                "exhaustive": true
            }
        ]
    }

ハイブリッド検索

POST {{endpoint}}/indexes/{{index_name}}/docs/search?api-version={{api_version}}  HTTP/1.1
Content-Type: application/json
api-key: {{admin_key}}

    {
        "count": true,
        "search": "検索語句",
        "select": "parent_id, title, chunk, chunk_id",
        "vectorQueries": [
            {
                "vector": [-0.045507785,0.028645637, 後略]
                "k": 5,
                "fields": "text_vector",
                "kind": "vector",
                "exhaustive": true
            }
        ]
    }

セマンティックハイブリッド検索

semanticConfigurationには、インデックスに割り当てたセマンティック構成の値を設定。

POST {{endpoint}}/indexes/{{index_name}}/docs/search?api-version={{api_version}}  HTTP/1.1
Content-Type: application/json
api-key: {{admin_key}}

{
    "count": true,
    "search": "検索語句",
    "select": "parent_id, title, chunk, chunk_id",
    "queryType": "semantic",
    "semanticConfiguration": "semantic-configuration",
    "vectorQueries": [
        {
            "vector": [-0.045507785,0.028645637, 後略],
            "k": 5,
            "fields": "text_vector",
            "kind": "vector",
            "exhaustive": true
        }
    ]
}

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up