More than 3 years have passed since last update.

Azure Cognitive SearchでBlob上のPDFを検索してみた（Python）

Last updated at 2021-04-27Posted at 2021-04-20

#この記事のついて
この記事はAzure CognitiveサービスとAzure Blob Storageの関係と接続方法を纏めます。Azure CognitiveサービスのSDKサンプルは5種類言語までをサポートされていますが[1]、今回REST APIとPythonで試します。

[1] https://docs.microsoft.com/ja-jp/azure/search/samples-rest 、etc.

###参考
https://qiita.com/omiyu/items/7fa6cf49652a894cedb8
Azure Cognitiveの入門
 Cognitive SearchとBlob

#前提条件
Azureサブスクリプション
Azure Cognitive無料版
Azure Blob Storage
Python

#手順
以下の4つの手順をやっていきます

Data Storageの作成
Skillsetの作成
Indexの作成
Indexerの作成

手順が全てを完了した場合は、作成したCognitiveサービスでPDFファイルの探すを試します。

##1. Data Storageの作成
こちらの手順を進む前に前提条件を完了してください。さらに以下の前提条件を完了してください。

Azure Blob Container作成完了

新しいBlobでダミーPdfをたくさん入れます。自分の場合はダミーPDFファイルは学会の論文なので、こちらのサイトから一部無料で取りできます。

新しいPythonファイルを作成してください。今回は「create_datasource.py」と名前を付けます。

create_datasource.py

import requests
import json

cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

request_body = {
  "name" : "<自分のData Source名>", #データソース名を好きに入れてください
  "description" : "データソースの作成",
  "type" : "azureblob",
  "credentials" :
  { "connectionString" :
    "DefaultEndpointsProtocol=https;AccountName=<自分のAzure Storage名>;AccountKey=自分のAzure Storage APIキー;"
  },
  "container" : { "name" : "自分のAzure StorageのBlobコンテイナー名" }
}

headers = { "api-key " : api_key, "Content-Type" : "application/json" }
ret = requests.post(cognitive_url + "/datasources?api-version=2020-06-30", headers=headers, data=json.dumps(request_body))

print(ret)

Pythonコードを実行したら、ステータスコードを「201」をもらえば問題なし。Azure ポータルでData Sourceを確認してください。Data Sourceの下に、名とテーブルがあります。名は自分で付けたData Sourceの名です。テーブルはBlob Containerの名です。

##2. Skillsetの作成
Skillsetについてはこちらの参考リンクで説明されます
参考リンク : https://docs.microsoft.com/ja-jp/rest/api/searchservice/create-skillset

この記事で、以下の4つのスキルを利用します。

または、他のスキルを試したい方はこちらの参考リンクへ試してください。
参考リンク : https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-predefined-skills

create_skillset.py

import requests
import json


cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

request_body = {
  "description": "エンティティ引き抜く、 言語認識、キーワード引き抜く",
  "skills":
  [
    {
      "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
      "categories": [ "Person", "Organization", "Location" ],
      "defaultLanguageCode": "en", #jaに変更してもできます
      "inputs": [
        { "name": "text", "source": "/document/content" }
      ],
      "outputs": [
        { "name": "persons", "targetName": "persons" },
        { "name": "organizations", "targetName": "organizations" },
        { "name": "locations", "targetName": "locations" }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
      "inputs": [
        { "name": "text", "source": "/document/content" }
      ],
      "outputs": [
        { "name": "languageCode", "targetName": "languageCode" }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "textSplitMode" : "pages",
      "maximumPageLength": 4000,
      "inputs": [
        { "name": "text", "source": "/document/content" },
        { "name": "languageCode", "source": "/document/languageCode" }
      ],
      "outputs": [
        { "name": "textItems", "targetName": "pages" }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
      "context": "/document/pages/*",
      "inputs": [
        { "name": "text", "source": "/document/pages/*" },
        { "name":"languageCode", "source": "/document/languageCode" }
      ],
      "outputs": [
        { "name": "keyPhrases", "targetName": "keyPhrases" }
      ]
    }
  ]
}

headers = { "api-key " : api_key, "Content-Type" : "application/json" }
ret = requests.put(cognitive_url + "/skillsets/<スキル名を好きに入れてください>?api-version=2020-06-30", headers=headers, data=json.dumps(request_body))

print(ret)

このスキルセットはドキュメントの内容を検査して、ドキュメントの内容は入力になります。各スキルの出力は後でインデックスで利用されます。inputsでnameとは検査すべきの内容タイプです。sourceは内容のソース。もしドキュメント内容から、関係者のメールをいただければ、出力はこんな形に変更します。

{
    "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
    "categories": [ "Person", "Organization", "Location" ],
    "defaultLanguageCode": "en",
    "inputs": [
        { "name": "text", "source": "/document/content" }
    ],
    "outputs": [
        { "name": "emails", "targetName": "emails" }
    ]
}

けれど、ドキュメントからメールを認識されるためにサポートされてdefaultLanguageCodeはde、en、es、fr、zh-hansだけですが、ご注意してください。コードを実行したら、問題なければステータスコード「201」を貰います。Azure ポータルでSkillsetsを確認してください。Skillsetsの下に、先ほど作成したスキルがあります。

##3. Indexの作成
インデックスとはデータベースのテーブルのようにデータを整理します。インデクスとはJSONスキームを対応して、ドキュメントのデータを保存します。さらに、サジェストやスコアリングプロファイルという能力も指定されています。前回はスキルから貰った、複数の出力が利用します。Entity Recognition Skillからドキュメントの著者と関係者、組織、と場所は認識しています。Language Detection Skillからドキュメントの言語を認識されています。Text Split SkillとはただのKey Phrase Extraction Skillの一部、データを分裂して、Key Phrase Extraction SkillでkeyPhrasesとしてキーワードを認識さています。

いままでのスキル出力は: persons, organizations, locations, languageCode, とkeyPhrases。それを全てインデクスに入れます。さらに、id、ファイル名、ドキュメントの内容を入れます。

create_index.py

import requests
import json


cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

request_body = {
  "fields": [
    {
      "name": "id", #フィルのID
      "type": "Edm.String",
      "key": True,
      "searchable": True,
      "filterable": False,
      "facetable": False,
      "sortable": True
    },
    {
      "name": "metadata_storage_name", #ファイルのmetadata名
      "type": "Edm.String",
      "searchable": False,
      "filterable": False,
      "facetable": False,
      "sortable": False
    },
    {
      "name": "fileName", #ファイル名
      "type": "Edm.String",
      "sortable": True,
      "searchable": True,
      "filterable": False,
      "facetable": False
    },
    {
      "name": "content", #ファイルの内容
      "type": "Edm.String",
      "sortable": False,
      "searchable": True,
      "filterable": False,
      "facetable": False
    },
    {
      "name": "languageCode", #ファイルの言語
      "type": "Edm.String",
      "searchable": True,
      "filterable": False,
      "facetable": False
    },
    {
      "name": "keyPhrases", #ファイルのキーワード
      "type": "Collection(Edm.String)",
      "searchable": True,
      "filterable": False,
      "facetable": False
    },
    {
      "name": "persons", #ファイルで名前を付けた著者や関係者
      "type": "Collection(Edm.String)",
      "searchable": True,
      "sortable": False,
      "filterable": True,
      "facetable": True
    },
    {
      "name": "organizations", #ファイルで名を付けた組織
      "type": "Collection(Edm.String)",
      "searchable": True,
      "sortable": False,
      "filterable": True,
      "facetable": True
    },
    {
      "name": "locations", #ファイルで名を付けた場所
      "type": "Collection(Edm.String)",
      "searchable": True,
      "sortable": False,
      "filterable": True,
      "facetable": True
    }
  ]
}

headers = { "api-key " : api_key, "Content-Type" : "application/json" }
ret = requests.put(cognitive_url + "/indexes/<インデクス名を好きに入れてください>?api-version=2020-06-30", headers=headers, data=json.dumps(request_body))

print(ret)

コードを実行したら、問題なければステータスコード「201」を貰います。Azure ポータルでIndexを確認してください。Indexの下に、先ほど作成したインデクスがあります。

##4. Indexerの作成
ステップの最後はこのインデクサーの作成です。インデクサーとはサポートされたデータベースからインデクス作成を自動化し。簡単に言うと、インデクサーは先ほど作成したデータソース、インデクス、とスキルを接続してマップを作成。

新しいpythonファイルを作成して、create_indexer.pyで名前を付けます。

create_indexer.py

import requests
import json


cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

request_body = {
  "name":"<インデクサー名を好きに入れてください>",
  "dataSourceName" : "<先ほど作成したデータソース名>",
  "targetIndexName" : "<先ほど作成したインデクス名>",
  "skillsetName" : "<先ほど作成したスキル名>",
  "fieldMappings" : [
    {
      "sourceFieldName" : "metadata_storage_path",
      "targetFieldName" : "id",
      "mappingFunction" :
        { "name" : "base64Encode" }
    },
    {
      "sourceFieldName" : "metadata_storage_name",
      "targetFieldName" : "metadata_storage_name",
      "mappingFunction" :
        { "name" : "base64Encode" }
    },
    {
      "sourceFieldName" : "content",
      "targetFieldName" : "content"
    },
    {
      "sourceFieldName" : "metadata_storage_name",
      "targetFieldName" : "fileName"
    }
  ],
  "outputFieldMappings" :
  [
    {
      "sourceFieldName" : "/document/persons",
      "targetFieldName" : "persons"
    },
    {
      "sourceFieldName" : "/document/organizations",
      "targetFieldName" : "organizations"
    },
    {
      "sourceFieldName" : "/document/locations",
      "targetFieldName" : "locations"
    },
    {
      "sourceFieldName" : "/document/pages/*/keyPhrases/*",
      "targetFieldName" : "keyPhrases"
    },
    {
      "sourceFieldName": "/document/languageCode",
      "targetFieldName": "languageCode"
    }
  ],
  "parameters":
  {
    "maxFailedItems":-1,
    "maxFailedItemsPerBatch":-1,
    "configuration":
    {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "firstLineContainsHeaders": False,
      "delimitedTextDelimiter": ","
    }
  }
}

headers = { "api-key " : api_key, "Content-Type" : "application/json" }
ret = requests.put(cognitive_url + "/indexers/<インデクサー名>?api-version=2020-06-30", headers=headers, data=json.dumps(request_body))

print(ret)

インデクサーを作るとき、dataSourceName、targetIndexName、とskillsetNameは必須ですねど、前のステップで作成した入れたら完了です

次はfieldMappingsとoutputFieldMappings。この2つの内容は必須ではないが、説明したら、もっとしやすくになると思います。なぜ必須とするとは、もし先ほどソース名と希望したターゲット名が違えば、fieldMappingsで説明ができます。その例は、先ほどindexでfileNameという変数が用意したが、このfileNameのバリューは本当はデータベースからとる、PDFファイル名でしたけれど、DBから呼び出せるとmetadata_storage_nameになります。なので、インデクサーのfieldMappingsでfileName変数にmetadata_storage_nameを割り当てます。

{
   "sourceFieldName" : "metadata_storage_name",
   "targetFieldName" : "fileName"
}

次はoutputFieldMappingsの場合。このプロパティはスキルからの出力とインデクスの変数のマップをされています。

コードを実行したら、問題なければステータスコード「201」を貰います。Azure ポータルでIndexerを確認してください。Indexerの下に、先ほど作成したインデクサーがあります。

#クエリーの実験
これで、Azure Cognitiveサーチのサービスを完了です。今度は先ほど作成したサービスを試します。REST APIを使って、PDFファイルを探します。まずは、新しいpyファイルを作成します。

search_test.py

import requests
import json


cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

keyword = "*"
query = "search=" + keyword + "&$count=true&$select=*"

headers = { "api-key " : api_key, "Content-Type" : "application/json" }
ret = requests.get(cognitive_url + "/indexes/<自分のインデクス名>/docs?"+ query + "&api-version=2020-06-30", headers=headers)

print(ret)
res = json.loads(ret.text)
print(res["value"])

自分の「Cognitive URL」、「api_key」、と「インデクス名」を入れて、APIを叩きます。叩くの結果はJSONテキストを返されます。PDFファイルは長ったから、JSON応答をプリントしなくていいです。print(res(value))をコメントして、こちらのコードを下にいれてください。

print ("データ数 : " + str(len(res["value"])))
for item in res["value"]:
    print(item["fileName"])

これで、データ数の確認とデータの「fileName」を確認できました。

次はkeyword変数で弄ってみます。例えば、ダミーの単語を付けたドキュメント探してみます。こんなドキュメント。

APIを叩きしたら、こんな結果になります。

POSTクエリもできます。先のコードを少し変更します。

...

#keyword = "ダミー"
#query = "search=" + keyword + "&$count=true&$select=*"

headers = { "api-key " : api_key, "Content-Type" : "application/json" }
request_body = {
    "search" : "ダミー", #キーワード
    "count" : "true",
    "select" : "*"
}
#ret = requests.get(cognitive_url + "/indexes/<自分のインデクス名>/docs?"+ query + "&api-version=2020-06-30", headers=headers)
ret = requests.post(cognitive_url + "/indexes/<自分のインデクス名>/docs/search?api-version=2020-06-30", headers=headers, data=json.dumps(request_body))

...

結果は同じです。けれどPOSTメソッドの方が目にもっといい感じではありませんか。

参考：https://docs.microsoft.com/ja-jp/azure/search/search-query-overview

#データ更新した時
データを変更した時、Blobストレージを変更してますので。データ増えるとか、データ消すとか。最新のデータを検査できませんので、その場合はIndexerを一旦リセットします。

reset_indexer.py

import requests
import json

cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

headers = { "api-key " : api_key, "Content-Type" : "application/json" }

#インデクサーをリセットします
ret = requests.post(cognitive_url + "/indexers/<自分のインデクサー名>/reset?api-version=2020-06-30", headers=headers)
print(ret)

インデクサーをリセットした後、インデクサーの状況を変更されたまでは数秒ぐらいかかります。Azure PortalでIndexerタブの下に状況確認ができます。

API叩きでも、インデクサーの状況を確認できます。

monitor_indexer.py

import requests
import json

cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

headers = { "api-key " : api_key, "Content-Type" : "application/json" }

#インデクサーの状況を確認
ret = requests.post(cognitive_url + "/indexers/<自分のインデクサー名>/status?api-version=2020-06-30", headers=headers)
print(ret)

状況は「Reset」になったら、インデクサーを起動します。下のコードを発行したら、インデクサーの状況を「In Progress」となります。数秒を待つと、状況は「Success」となります。その時は、インデクサーの再起動流れは完了しました。

run_indexer.py

import requests
import json

cognitive_url = "<自分のAzure Cognitive URL>" #普段は"<自分のAzure Cognitiveサービス名.search.windows.net>"
api_key = "<自分のAzure Cognitive APIキー>"

headers = { "api-key " : api_key, "Content-Type" : "application/json" }

#インデクサーを起動します
ret = requests.post(cognitive_url + "/indexers/<自分のインデクサー名>/run?api-version=2020-06-30", headers=headers)
print(ret)

起動できなかったばあいはこちらのステータスコードに参考してください。
参考 : https://docs.microsoft.com/ja-jp/rest/api/searchservice/http-status-codes

#結論
今回はAzure Cognitive SearchでBlob上のPDFを検索してみた。結果を見るとすごいサービスで、どんなドキュメントがこのキーワード付けましたか、著者と関係者、組織、と場所までを探すことができました、まるでGoogle Search Engineのようで、素晴らしいです。

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up