Azure AI Search で PDFファイルをインデックス化(Python SDK)

Last updated at 2025-09-07Posted at 2025-08-14

今更ながら、Azure AI Searchを使って Azure Data Lake Storage Gen2 から PDFファイルをインデックス化しました。
特別なことはやっておらず、以下をほとんどなぞっただけで、自分自身の理解のために少し解説などを加えています。

実行手順

1. 必要リソース作成

以下のリソースを作っています。

種類	用途	備考
Azure AI Search	今回の主役	価格プランはBasicですが、Freeでも多分可能(未確認)
Azure AI サービスのマルチサービスアカウント	SkillのEntity Recognitionで使用
Azure OpenAI	Embeddingに使用
ストレージアカウント	PDFのデータソース場所	Data Lake Storage Gen2

2. Azure AI Search 設定

2.1. API アクセス制御設定

Azure Portalからメニュー設定 -> キーでAPIアクセス制御を「両方」に設定

2.2. システム割り当てマネージドID作成

Azure Portalからメニュー設定 -> ID で「システム割り当て済み」タブからマネージドID作成

3. Azure OpenAI関連

以下を実行。詳細画面は省略。

モデル text-embedding-3-smallをデプロイ
ロール割当「Cognitive Services OpenAI User」をAI SearchのマネージドIDに追加

4. ストレージアカウント

コンテナを作成し、PDFファイルを保存。
ロール割当「ストレージ BLOB データ共同作成者」をAI SearchのマネージドIDに追加。詳細画面は省略

5. Azure AI サービスのマルチサービスアカウント

ロール割当「Cognitive Services 共同作成者」をAI SearchのマネージドIDに追加。権限関連のトラブルシュート時に付与したもので、本当は不要だったかもしれないが未確認。詳細画面は省略。

5. Python Script 作成

Python 3.12.9
Ubuntu 22.04.5
Jupyter使用

project.toml抜粋

jupyterlab = "^4.4.5"
azure-identity = "^1.24.0"
azure-search-documents = "11.6.0b12"
python-dotenv = "^1.1.1"

5.1. パッケージインポートと環境変数読込

長い・・・

import os

from azure.identity import DefaultAzureCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex,
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    SplitSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    EntityRecognitionSkill,
    SearchIndexerIndexProjection,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    SearchIndexerSkillset,
    CognitiveServicesAccountKey,
    SearchIndexer,
    FieldMapping
)
from azure.search.documents.models import VectorizableTextQuery
from dotenv import load_dotenv

load_dotenv(override=True)

環境変数は.envに設定

.env

AZURE_SEARCH_ENDPOINT="https://<resource>.search.windows.net"
AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com/"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME="text-embedding-3-small"

# Azure AI Service のKey
AI_SERVICE_KEY="<KEY>"
SUBSCRIPTION_ID="<ID>"
RESOURCE_GROUP_NAME="rg-rag-jp"
STORAGE_ACCOUNT_NAME="storageragjpe"

5.2. 初期処理

固定値設定などの初期処理です。

load_dotenv(override=True)
credential = DefaultAzureCredential()

AZURE_SEARCH_SERVICE: str = os.environ["AZURE_SEARCH_ENDPOINT"]
EMBEDDING_DEPLOYMENT: str = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"]
AZURE_OPENAI_SERVICE: str = os.environ["AZURE_OPENAI_ENDPOINT"]
AZURE_OPENAI_DEPLOYMENT: str = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"]
AI_SERVICE_KEY: str = os.environ["AI_SERVICE_KEY"]
INDEX_NAME: str = "py-rag-tutorial-idx"
CONNECTION_STRING: str = (
 f"ResourceId=/subscriptions/{os.environ['SUBSCRIPTION_ID']}/"
 f"resourceGroups/{os.environ['RESOURCE_GROUP_NAME']}/providers/"
 f"Microsoft.Storage/storageAccounts/{os.environ['STORAGE_ACCOUNT_NAME']};"
)

5.3. インデックス作成

5.3.1. インデックススキーマ作成

5.3.1.1. フィールド作成

locations以外は、一般的なファイルのベクトルインデックス化した項目です。
locationsはEntity Extractionした結果を入れるので、通常は不要です。
text_vectorの次元数はモデルによって変更しましょう。今回は、text-embedding-3-smallを使っています。
項目"chunk_id"に割り当てているアナライザーkeywordは、以下の記載。

フィールドの内容全体を 1 つのトークンとして扱います。

index_client = SearchIndexClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)
fields = [
    SearchField(name="parent_id", type=SearchFieldDataType.String),
    SearchField(name="title", type=SearchFieldDataType.String),
    SearchField(
        name="locations",
        type=SearchFieldDataType.Collection(SearchFieldDataType.String),
        filterable=True,
    ),
    SearchField(
        name="chunk_id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
        analyzer_name="keyword",
    ),
    SearchField(
        name="chunk",
        type=SearchFieldDataType.String,
        sortable=False,
        filterable=False,
        facetable=False,
    ),
    SearchField(
        name="text_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        vector_search_dimensions=1536, # text-embedding-3-small用。モデルによって要変更
        vector_search_profile_name="myHnswProfile",
    ),
]

インデックス作成後にフィールドみると、こんな状態。

5.3.1.2. ベクトルプロファイル作成

# Configure the vector search configuration
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(name="myHnsw"),
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
            vectorizer_name="myOpenAI",
        )
    ],
    vectorizers=[
        AzureOpenAIVectorizer(
            vectorizer_name="myOpenAI",
            kind="azureOpenAI",
            parameters=AzureOpenAIVectorizerParameters(
                resource_url=AZURE_OPENAI_SERVICE,
                deployment_name=AZURE_OPENAI_DEPLOYMENT,
                model_name=AZURE_OPENAI_DEPLOYMENT,
            ),
        ),
    ],
)

インデックス作成後に見るとこの状態。

5.3.1.3. インデックススキーマ作成実行

インデックススキーマを作成します。

# Create the search index
index = SearchIndex(name=INDEX_NAME, fields=fields, vector_search=vector_search)
result = index_client.create_or_update_index(index)
print(f"{result.name} created")

5.3.2. データソース接続作成

データソース接続を作成します。

indexer_client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)
container = SearchIndexerDataContainer(name="rag-doc-test")
data_source_connection = SearchIndexerDataSourceConnection(
    name="py-rag-tutorial-ds",
    type="adlsgen2",  #"azureblob",
    connection_string=CONNECTION_STRING,
    container=container
)
test = indexer_client.get_indexers()
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

作られたデータソースを画面で確認。

5.3.3. スキルセット作成

スキルを3つ定義します。

skillset_name = "py-rag-tutorial-ss"

# Chunking
split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    context="/document",  
    maximum_page_length=2000,  
    page_overlap_length=500,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)

# Embedding
embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate embeddings via Azure OpenAI",  
    context="/document/pages/*",  
    resource_url=AZURE_OPENAI_SERVICE,  
    deployment_name=AZURE_OPENAI_DEPLOYMENT,  
    model_name=AZURE_OPENAI_DEPLOYMENT,
    dimensions=1536,#1024,
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="text_vector")  
    ],  
)

# Entity Recognition
entity_skill = EntityRecognitionSkill(
    description="Skill to recognize entities in text",
    context="/document/pages/*",
    categories=["Location"],
    default_language_code="en",
    inputs=[
        InputFieldMappingEntry(name="text", source="/document/pages/*")
    ],
    outputs=[
        OutputFieldMappingEntry(name="locations", target_name="locations")
    ]
)

続けてインデックスプロジェクションを定義し、スキルセットを作成。
Azure PortalのAI Searchでメニュー検索管理 -> スキルセットから確認可能(ただのJSONなので記載省略)。

index_projections = SearchIndexerIndexProjection(  
    selectors=[  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=INDEX_NAME,  
            parent_key_field_name="parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  
                InputFieldMappingEntry(name="text_vector", source="/document/pages/*/text_vector"),
                InputFieldMappingEntry(name="locations", source="/document/pages/*/locations"),  
                InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
) 

skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=[split_skill, embedding_skill, entity_skill],  
    index_projection=index_projections,
    cognitive_services_account=CognitiveServicesAccountKey(key=AI_SERVICE_KEY),
)

client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")

5.3.4. インデクサー作成

インデクサーを作成します。

# Create an indexer  
indexer_name = "py-rag-tutorial-idxr" 

indexer_parameters = None

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=INDEX_NAME,  
    data_source_name=data_source.name,
    # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results  
    # スキル側でもマッピングしていて重複なので不要(未確認だが子に対しては projection が効き、Indexer の field_mappings は親側の単純写しにしか効かない)
    # field_mappings=[FieldMapping(source_field_name="metadata_storage_name", target_field_name="title")],
    parameters=indexer_parameters
)  

# Create and run the indexer  
indexer_client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  

print(f' {indexer_name} is created and running. Give the indexer a few minutes before running a query.')

Azure Portal上で作られたインデクサーが確認できます。

5.4. テスト

Azure AI Searchのテストを実行。

# Vector Search using text-to-vector conversion of the querystring
query = "固定通信ネットワークとは？"  

search_client = SearchClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential, index_name=INDEX_NAME)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["chunk"],
    top=1
)  

for result in results:  
    print(f"Score: {result['@search.score']}")
    print(f"Chunk: {result['chunk']}")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up