More than 1 year has passed since last update.

OCI Search with OpenSearchでセマンティック検索やってみた。(2024/04/08)

Last updated at 2024-04-25Posted at 2024-04-08

はじめに

OCI Search with OpenSearchでは、OpenSearchバージョン2.11以降のセマンティク検索がサポートされています。

セマンティック検索では、キーワード検索と比較して問合せの意味をよく理解して検索します。OpenSearchは、ニューラル検索を使用してセマンティック検索を実装します。
今回はOCI Search with OpenSearchで事前トレーニング済モデルを使用したセマンティック検索を実施してみました。

前提条件

OpenSearchバージョンが2.11以降のOpenSearchクラスタのプロビジョニング
- OCI Search Service for OpenSearch を使って検索アプリケーションを作成しようなどを参考
OpenSearch ダッシュボードへのアクセス
- 作業は OpenSearch ダッシュボードの Dev Toolsを使って実行

作業ステップ

セマンティック検索を実行できるようにクラスタ設定を更新
ステップ1 モデル・グループの登録
ステップ2 モデルの登録およびデプロイ
ステップ3 モデルのデプロイ
ステップ4 Ingestionパイプラインの作成
ステップ5 インデックスの作成
ステップ6 ドキュメントの取り込み
ステップ7 Embeddingが正しく生成されていることを確認
セマンティック検索の実行

実行例

セマンティック検索を実行できるようにクラスタ設定を更新

![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/108635/73b6de10-c6d4-b276-7a80-385f443055da.png)
PUT _cluster/settings
{
  "persistent": {
    "plugins": {
      "ml_commons": {
        "only_run_on_ml_node": "false",
        "model_access_control_enabled": "true",
        "native_memory_threshold": "99",
        "rag_pipeline_feature_enabled": "true",
        "memory_feature_enabled": "true",
        "allow_registering_model_via_local_file": "true",
        "allow_registering_model_via_url": "true",
        "model_auto_redeploy.enable":"true",
        "model_auto_redeploy.lifetime_retry_times": 10
        }
      }
    }
}

モデルの設定

テキスト・フィールドからベクトルを生成するために使用する大規模言語モデルを設定します。

ステップ1 モデル・グループの登録

特定のモデルへのアクセスを管理するモデル・グループを作成します。
モデル・グループの登録APIの実行

POST /_plugins/_ml/model_groups/_register
{
  "name": "general pretrained models",
  "description": "A model group for pretrained models hosted by OCI Search with OpenSearch"
}

レスポンスで返されたmodel_group_idを記録

{
  "model_group_id": "SGkEvI4BapwKtdhKEyeT",
  "status": "CREATED"
}

ステップ2 モデルの登録およびデプロイ

事前トレーニング済モデルを登録します。以下の情報を入力します。

model_group_id: 登録するモデル・グループのmodel_group_id
名前: 使用する事前トレーニング済モデルのモデル名
バージョン: 使用する事前トレーニング済モデルのバージョン番号
model_format: モデルの形式(TORCH_SCRIPTまたはONNX)

モデルの登録例

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b",
  "version": "1.0.2",
  "model_group_id": "SGkEvI4BapwKtdhKEyeT",
  "model_format": "TORCH_SCRIPT
}

レスポンスで返されたtask_idを記録

{
  "task_id": "SWkOvI4BapwKtdhK2yeo",
  "status": "CREATED"
}

task_id を使用してモデル登録のステータスを確認

GET /_plugins/_ml/tasks/SWkOvI4BapwKtdhK2yeo

レスポンスのステータスがCOMPLETEDになっていることを確認し、レスポンスで返されたmodel_idを記録

{
  "model_id": "808OvI4B0FtGqzkq3l8a",
  "task_type": "REGISTER_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "COMPLETED",
  "worker_node": [
    "0Hic_MLcR1WQhLq3zCgMuw"
  ],
  "create_time": 1712552074058,
  "last_update_time": 1712552118631,
  "is_async": true
}

ステップ3 モデルのデプロイ

model_idを指定してモデルをクラスタにデプロイ

POST /_plugins/_ml/models/808OvI4B0FtGqzkq3l8a/_deploy

レスポンスで返されたtask_idを記録

{
  "task_id": "SmkTvI4BapwKtdhKryf1",
  "task_type": "DEPLOY_MODEL",
  "status": "CREATED"
}

task_id を使用してモデル登録のステータスを確認

GET /_plugins/_ml/tasks/SmkTvI4BapwKtdhKryf1

レスポンスのステータスがCOMPLETEDになっていることを確認します。

{
  "model_id": "808OvI4B0FtGqzkq3l8a",
  "task_type": "DEPLOY_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "COMPLETED",
  "worker_node": [
    "0Hic_MLcR1WQhLq3zCgMuw"
  ],
  "create_time": 1712552390644,
  "last_update_time": 1712552422211,
  "is_async": true
}

ステップ4 Ingestionパイプラインの作成

デプロイされたモデルを使用してIngestionパイプラインを作成します。

Ingestionパイプラインは、デプロイされたモデルを使用して、データ取込み時に各ドキュメントのEmbedベクトルを自動的に生成します。
- ベクトル化するテキストフィールドを適切にマッピングするだけで済みます。

Ingestionパイプラインの作成例

PUT _ingest/pipeline/pipeline_name01
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "808OvI4B0FtGqzkq3l8a",
        "field_map": {
          "passage_text": "passage_embedding"
        }
      }
    }
  ]
}

{
  "acknowledged": true
}

ステップ5 インデックスの作成

作成したIngestionパイプラインを使用して使用可能な任意のANNエンジンを指定し、インデックスを作成します。

Lucene Engineを使用した例

インデックスの作成対象のpassage_textフィールドは、Ingestionパイプラインのpassage_textフィールドと一致するため、パイプラインはEmbeddingの作成方法を認識し、取込み時にドキュメントにマップします。

PUT /lucene-index
{
  "settings": {
    "index.knn": true,
    "default_pipeline": "pipeline_name01"
  },
  "mappings": {
    "properties": {
      "embedding_field_name01": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name":"hnsw",
          "engine":"lucene",
          "space_type": "l2",
          "parameters":{
            "m":512,
            "ef_construction": 245
          }
        }
      },
      "passage_text": {
        "type": "text"
      }
    }
  }
}

インデックスが正常に作成された場合のレスポンス

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "lucene-index"
}

ステップ6 ドキュメントの取り込み

データを取り込みます。(1-4)

POST /lucene-index/_doc/1
{
  "passage_text": "there are many sharks in the ocean"
}

レスポンス

{
  "_index": "lucene-index",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

POST /lucene-index/_doc/2
{
  "passage_text": "fishes must love swimming"
}

レスポンス

{
  "_index": "lucene-index",
  "_id": "2",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}

POST /lucene-index/_doc/3
{
  "passage_text": "summers are usually very hot"
}

レスポンス

{
  "_index": "lucene-index",
  "_id": "3",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 1
}

POST /lucene-index/_doc/4
{
  "passage_text": "florida has a nice weather all year round"
}

レスポンス

{
  "_index": "lucene-index",
  "_id": "4",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 3,
  "_primary_term": 1
}

ステップ7 Embeddingが正しく生成されていることを確認

Embeddingが正しくされていることを確認

GET /lucene-index/_doc/3

レスポンス

{
  "_index": "lucene-index",
  "_id": "3",
  "_version": 1,
  "_seq_no": 2,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "passage_embedding": [
      -0.12549585,
      -0.31517762,
      0.03526806,
      0.39322084,
      -0.04755569,
      -0.12378363,
      -0.032554734,
以下略

セマンティック検索の実行

登録しデプロイしたモデルIDを使用して、セマンティック検索を実行

GET lucene-index/_search
{
  "query": {
    "bool" : {
      "should" : [
        {
          "script_score": {
            "query": {
              "neural": {
                "passage_embedding": {
                  "query_text": "what are temperatures in miami like",
                  "model_id": "9U80vI4B0FtGqzkqK19d",
                  "k": 2
                }
              }
            },
            "script": {
              "source": "_score * 1.5"
            }
          }
        }
      ]
    }
  },
  "fields": [
    "passage_text"
  ],
  "_source": false
}

検索ではフロリダ、天気、夏などは言及していませんがモデルが気温とマイアミのセマンティックな意味を推測し、最も関連性の高い回答を返します。

{
  "took": 64,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.032253794,
    "hits": [
      {
        "_index": "lucene-index",
        "_id": "4",
        "_score": 0.032253794,
        "fields": {
          "passage_text": [
            "florida has a nice weather all year round"
          ]
        }
      },
      {
        "_index": "lucene-index",
        "_id": "3",
        "_score": 0.03148755,
        "fields": {
          "passage_text": [
            "summers are usually very hot"
          ]
        }
      }
    ]
  }
}

登録しデプロイしたモデルIDを使用して、ニューラル検索を使用してセマンティック検索を実行

GET /lucene-index/_search
{
  "_source": {
    "excludes": [
      "passage_embedding"
    ]
  },
  "query": {
    "neural": {
      "passage_embedding": {
        "query_text": "good climate",
        "model_id": "9U80vI4B0FtGqzkqK19d",
        "k": 5
      }
    }
  }
}

{
  "took": 35,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 0.021540467,
    "hits": [
      {
        "_index": "lucene-index",
        "_id": "4",
        "_score": 0.021540467,
        "_source": {
          "passage_text": "florida has a nice weather all year round"
        }
      },
      {
        "_index": "lucene-index",
        "_id": "3",
        "_score": 0.020256795,
        "_source": {
          "passage_text": "summers are usually very hot"
        }
      },
      {
        "_index": "lucene-index",
        "_id": "1",
        "_score": 0.010399483,
        "_source": {
          "passage_text": "there are many sharks in the ocean"
        }
      },
      {
        "_index": "lucene-index",
        "_id": "2",
        "_score": 0.009416516,
        "_source": {
          "passage_text": "fishes must love swimming"
        }
      }
    ]
  }
}

おわりに

OpenSearch 2.11 を使ってベクトル生成、セマンティック検索ができました。

参考情報

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up