More than 1 year has passed since last update.

AWS AnalyticsAdvent Calendar 2022

OpenSearch を使用した類似文書検索

Last updated at 2023-01-16Posted at 2022-12-17

Updated: 2023/1/16

以下の記事の内容を英語に翻訳した記事 "Similar document search with OpenSearch" が OpenSearch Blog にて公開されました。そちらも是非ご参照ください。

類似文書検索とは

類似文書検索とは、ある文書をクエリとして与えた時に、一連の文書から適切な文書を取得する技術のことを指します。例えばニュースサイトで、ある記事のページに「おすすめの記事」として別の記事がサジェストされているような場合に使われる技術です。類似した文書を取得するアプローチとしては、文書自体の類似度を元に取得する方法や、ユーザーがその文書を閲覧した記録に基づいて取得する方法がありますが、今回は前者に焦点を当て、検索エンジンである OpenSearch を使用した実現方法について説明します。

OpenSearch における類似文書検索

1. More like this クエリを使用した類似文書検索

OpenSearch で類似文書検索を行う一つの方法として、More like this query（リンクは OpenSearch のフォーク元の Elasticsearch のドキュメント）を使用した方法が挙げられます。

More like this クエリは、単語の出現頻度をベースとした検索であり、似た文書は同じ単語を持つという仮定を置いた検索になります。直感的な仕組みである一方で、似た意味を持つ単語を拾うことができないため柔軟性に欠けているという欠点もあります。

More like this クエリは、tf-idf という特徴量に基づいて計算されます。tf-idf は、ある単語の文書内にどれだけ出現するか（tf）と、ある単語が全文書の中でどれだけ出現しないか（idf）を考慮した特徴量であり、その値が高いほどその単語はその文書と関連性が高いと考えられます。More like this クエリでは、入力の文書を解析してタームに分割し、タームごとに tf-idf 値を出し、その上位 k 個のターム（= 上位 k 個の重要語）を抽出します。検索時には、抽出されたタームとの類似性が高い文書を、スコアの高いものから順に表示します。

2. k-NN を使用した類似文書検索

一方で、k-NN を使用した類似文書検索では、文書をニューラルネットワークなどのモデルを使用してベクトル化し、それらの類似度を測ることで類似した文書を取得します。この方法では、同じ単語を含まなくても似た意味を持つ文書のベクトルが近くなるように学習されているため、より高度な検索ができるというメリットがあります。その一方で、学習したモデルに大きく依存する点や、検索結果がなぜそうなったのかを理解しづらいというデメリットもあります。

OpenSearch は k-NN 検索に対応しており、通常の k-NN 検索や、データ量が増大した場合にも計算量を抑えることのできる近似 k-NN（Approximate k-NN を略して ANN と呼ばれます）を利用することができます。k-NN 検索を利用する上では、インデックス化するデータの前処理として、検索したいフィールドのテキストを機械学習モデルなどを通してベクトル化してから、OpenSearch にデータを投入する必要がありました。しかし、OpenSearch 2.4 では、カスタムモデルをアップロードする機能（ただし TorchScript 形式に限る）と、それを利用した Neural search という機能が experimental ではありますが追加されています。これにより、機械学習モデルさえアップロードすれば、インデックス時や検索時にデータをベクトル化する手間を省くことができます。

ただし、現時点では OpenSearch のマネージドサービスである Amazon OpenSearch Service は、対応している最新の OpenSearch バージョンが 2.3 であり、カスタムモデルのアップロードや Neural search に対応していない点に注意です。

OpenSearch における More like this、k-NN の検索例

OpenSearch 2.4 を使って、More like this クエリを使った検索と、Neural search プラグインを使用した k-NN 検索を行ってみます。

OpenSearch 環境の構築

OpenSearch は Docker を使用して立ち上げることができます。以下のような docker-compose.yml を用意します。今回はデモ用に使うに限るため、シングルノードにし、かつセキュリティプラグインを無効にしています。

docker-compose.yml

version: "3"
services:
  opensearch-node:
    image: opensearchproject/opensearch:latest
    container_name: opensearch-node
    environment:
      - discovery.type=single-node
      - "DISABLE_INSTALL_DEMO_CONFIG=true"
      - "DISABLE_SECURITY_PLUGIN=true"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - 9200:9200

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:latest
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      - "DISABLE_SECURITY_DASHBOARDS_PLUGIN=true"
      - "OPENSEARCH_HOSTS=http://opensearch-node:9200"

docker-compose.yml を作成したディレクトリで、以下のコマンドを実行して OpenSearch と OpenSearch Dashboards を起動します。

docker compose up -d

http://localhost:5601 にアクセスして、OpenSearch Dashboards にアクセスできていれば環境構築は完了です。

使用するデータセット

検索対象のデータを投入します。今回は、Amazon のレビューデータを集めたオープンソースのデータセットである The Multilingual Amazon Reviews Corpus データセットを使用します。以下がサンプルのデータですが、review_body というフィールドを対象として今回は類似文書検索を行っていきます。

{
  "review_id": "en_0802237",
  "product_id": "product_en_0417539",
  "reviewer_id": "reviewer_en_0649304",
  "stars": "3",
  "review_body": "I love this product so much i bought it twice! But their customer service is TERRIBLE. I received the second glassware broken and did not receive a response for one week and STILL have not heard from anyone to receive my refund. I received it on time, but am not happy at the moment.",
  "review_title": "Would recommend this product, but not the seller if something goes wrong.",
  "language": "en",
  "product_category": "kitchen"
}

More like this 検索

OpenSearch Dashboards の Dev Tools や、OpenSearch クライアントなどを使用してインデックスを作成します。以下では、amazon-review-index という名前のインデックスを作成しています。前述した通り、More like this クエリでは、入力テキストをタームに分割して得られる統計情報を利用しており、インデックス時にその分析を行った方が検索時のパフォーマンスが向上します。そのため、以下の review_body フィールドでは、"term_vector": "yes" としています。

PUT /amazon-review-index
{
  "mappings": {
    "properties": {
      "review_id": { "type": "keyword" },
      "product_id": { "type": "keyword" },
      "reviewer_id": { "type": "keyword" },
      "stars": { "type": "integer" },
      "review_body": { "type": "text", "term_vector": "yes" },
      "review_title": { "type": "text" },
      "language": { "type": "keyword" },
      "product_category": { "type": "keyword" }
    }
  }
}

データを投入します。データの投入方法として、curl を使用する方法、OpenSearch クライアントを使用する方法、Logstash などのデータ収集ツールを使う方法など様々なものがありますが、今回は一例として OpenSearch の Python クライアントである opensearch-py を使用したコードを示します。

import json
from opensearchpy import OpenSearch


def payload_constructor(data):
    payload_string = ''
    for datum in data:
        action = {'index': {'_id': datum['review_id']}}
        action_string = json.dumps(action) + '\n'
        payload_string += action_string
        this_line = json.dumps(datum) + '\n'
        payload_string += this_line
    return payload_string


index_name = 'amazon-review-index'
batch_size = 1000

client = OpenSearch(
    hosts=[{'host': 'localhost', 'port': 9200}],
    http_compress=True,
)

with open('../json/train/dataset_en_train.json') as f:
    lines = f.readlines()

for start in range(0, len(lines), batch_size):
    data = []
    for line in lines[start:start+batch_size]:
        data.append(json.loads(line))
    response = client.bulk(body=payload_constructor(data), index=index_name)

データの投入が完了したので More like this クエリの検索を行います。ID を指定する検索と、テキストを指定する検索の二種類があります。テキストを指定する検索では、検索時に毎回テキストの分析が行われるため、検索時のレイテンシーが増加することが考えられます。インデックス済みの文書に対して類似検索を行う場合は、ID を指定する検索を利用すると良いでしょう。
（参考：ElasticsearchのMore like this内部実装とパフォーマンス問題の解決）

以下では、使用するデータセットの項でサンプルとして挙げたデータをクエリとして検索してみます。min_term_freq は、文章中にタームが登場する数の下限を表しており、この値より頻度の小さいタームはインプットから除外されます。デフォルトは 2 ですが、今回レビューがそこまで長くないため、検索に必要なタームまで除外してしまう可能性があるため、1 に設定しています。

GET amazon-review-index/_search?size=5
{
  "query" : {
    "more_like_this" : {
      "fields" : ["review_body"],
      "like": {
        "_id": "en_0802237"
      },
      "min_term_freq": 1
    }
  },
  "fields": ["review_body", "stars"],
  "_source": false
}

検索結果は以下のようになります。クエリのテキストは、DeepL 翻訳にかけると「この製品が大好きなので、2回購入しました。でも、カスタマーサービスは最悪です。2つ目のガラス食器が割れてしまい、1週間経っても返事がなく、返金もまだです。このような場合、この商品を購入することになります。」となり、概ねカスタマーサービスに対する不満を感じるレビューとなっています。検索結果のテキストも response、customer、service、terrible などのワードを含んでおり、概ね似たような内容の結果が返ってきていることが確認できます。

{
  "took" : 53,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 779,
      "relation" : "eq"
    },
    "max_score" : 35.644646,
    "hits" : [
      {
        "_index" : "amazon-review-index",
        "_id" : "en_0398542",
        "_score" : 35.644646,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "Used it twice and the plastic clip that holds the strap snapped off. Useless now. Haven’t received a response from customer service. 5 months later, still no response. Terrible."
          ]
        }
      },
      {
        "_index" : "amazon-review-index",
        "_id" : "en_0157395",
        "_score" : 27.64468,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "Lost pressure in first month. Never received any response from their customer service . Total waste of money and time. Bought a different brand that works great, Do not buy this product."
          ]
        }
      },
      {
        "_index" : "amazon-review-index",
        "_id" : "en_0439049",
        "_score" : 26.461647,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "The product arrived damage, opened with a damaged soaking box. The seller was contacted and there was no response and no refund offered. The customer service is terrible. I do not recommend this seller."
          ]
        }
      },
      {
        "_index" : "amazon-review-index",
        "_id" : "en_0021763",
        "_score" : 26.441845,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "DO NOT ORDER FROM THEM!! I placed my order on February 12th and STILL have not received my item. To make matters worse I emailed the seller with my issue on March 10th and haven't so much as even gotten a response back. Terrible customer service and they just TOOK my money."
          ]
        }
      },
      {
        "_index" : "amazon-review-index",
        "_id" : "en_0582638",
        "_score" : 26.065836,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "Never received it! Was supposed to receive March 6. Now it’s the end of March. I’ve tried to contact the company twice with no response. Poor customer service! Don’t buy! Never received item and never got my money back! If I could give zero stars I would!!!! Amazon please intervene!"
          ]
        }
      }
    ]
  }
}

k-NN 検索

Neural search を使用した k-NN 検索を行います。カスタムモデルのアップロードと Neural search のドキュメントに沿った形で実験します。今回使用する機械学習モデルは、Hugging Face の sentence-transformer になります。

まず、カスタムモデルのアップロードを行います。

POST /_plugins/_ml/models/_upload
{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.0",
  "description": "test model",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}

すると、以下のようなレスポンスが返ってきます。

{
  "task_id" : "NHBlGYUBej1j0hjelDel", 
  "status" : "CREATED"
}

モデルアップロードのステータスを確認するために、以下の API を実行します。/_plugins/_ml/tasks/ の後に、直前のレスポンスにある task_id を渡してください。

GET /_plugins/_ml/tasks/<task_id>

レスポンスが以下のように "state" : "COMPLETED"となっていたらモデルのアップロードが完了です。

{
  "model_id" : "NXBlGYUBej1j0hjelTc0",
  "task_type" : "UPLOAD_MODEL",
  "function_name" : "TEXT_EMBEDDING",
  "state" : "COMPLETED",
  "worker_node" : "hGSG_GzpSGePCkmCmY3cvg",
  "create_time" : 1671168365569,
  "last_update_time" : 1671168376567,
  "is_async" : true
}

次に、モデルをノードにロードします。<model_id> の部分には、直前のレスポンスにある model_id を渡してください。

POST /_plugins/_ml/models/<model_id>/_load

load API のレスポンスにある task_id を使用して、上述した _ml/tasks API を実行し以下のように "state" : "COMPLETED" となっていればモデルのロードが完了です。

{
  "model_id" : "NXBlGYUBej1j0hjelTc0",
  "task_type" : "LOAD_MODEL",
  "function_name" : "TEXT_EMBEDDING",
  "state" : "COMPLETED",
  "worker_node" : "hGSG_GzpSGePCkmCmY3cvg",
  "create_time" : 1671238820338,
  "last_update_time" : 1671238820447,
  "is_async" : true
}

次に、アップロードしたモデルを使用してデータ取り込みのパイプラインを作成します。<model_id> の部分にはモデルの ID を入力し、field_map の部分には、どのフィールドのテキストをどのフィールドにベクトル変換するかを示します。今回であれば、review_body というフィールドのテキストを review_embedding というベクトル用のフィールドにマッピングするように記述します。

PUT _ingest/pipeline/nlp-pipeline
{
  "description": "An example neural search pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "<model_id>",
        "field_map": {
           "review_body": "review_embedding"
        }
      }
    }
  ]
}

次に、インデックスを作成します。default_pipeline には、先ほど作成したパイプラインの名前を入力します。また、review_embedding というフィールドを作り、以下のように k-NN に関する情報を設定します。

PUT /amazon-review-index-nlp
{
  "settings": {
    "index.knn": true,
    "default_pipeline": "nlp-pipeline"
  },
  "mappings": {
    "properties": {
      "review_embedding": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      },
      "review_id": { "type": "keyword" },
      "product_id": { "type": "keyword" },
      "reviewer_id": { "type": "keyword" },
      "stars": { "type": "integer" },
      "review_body": { "type": "text" },
      "review_title": { "type": "text" },
      "language": { "type": "keyword" },
      "product_category": { "type": "keyword" }
    }
  }
}

インデックスの作成が完了し、データの投入を行います。More like this 検索でのデータ投入と同様にデータを投入します。k-NN 検索用のインデックスでは新しく review_embedding というフィールドが作成されていますが、データパイプラインによってベクトル変換してデータが投入されるため、データ投入時は特にこのフィールドを意識する必要はありません。

データの投入が完了したら、検索を行いましょう。先ほどと同じテキストをクエリとしています。

GET amazon-review-index-nlp/_search?size=5
{
  "query": {
    "neural": {
      "review_embedding": {
        "query_text": "I love this product so much i bought it twice! But their customer service is TERRIBLE. I received the second glassware broken and did not receive a response for one week and STILL have not heard from anyone to receive my refund. I received it on time, but am not happy at the moment., review_title: Would recommend this product, but not the seller if something goes wrong.",
        "model_id": <model_id>,
        "k": 10
      }
    }
  },
  "fields": ["review_body", "stars"],
  "_source": false
}

検索結果は以下のようになります。こちらもカスタマーサービスや運搬に対する不満を述べたレビューが上位に返ってきていますが、More like this 検索と異なり、必ずしも共通の言葉を含んでないような結果も返ってきていることが分かります。例えば、クエリには glassware という言葉が含まれていましたが、More like this 検索では glassware に関連するような言葉（glass など）が返ってくることはありませんでした。しかし、k-NN 検索では、glass という言葉がレスポンスに入っていることが分かります。また、クエリにはない ship などの単語もレスポンスにあることが分かります。

これはあくまで一例ですが、k-NN 検索ではより入力の文章の意味を理解した結果が返ってくることが期待されます。

{
  "took" : 76,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 40,
      "relation" : "eq"
    },
    "max_score" : 0.61881816,
    "hits" : [
      {
        "_index" : "amazon-review-index-nlp",
        "_id" : "n3B3GYUBej1j0hjek1ug",
        "_score" : 0.61881816,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            """I guess I should have trusted the other reviews, as my arrived and was broken in its box. So now I need to return it for a refund? And ship broken glass..which I'm not really comfortable with 😫"""
          ]
        }
      },
      {
        "_index" : "amazon-review-index-nlp",
        "_id" : "aXB3GYUBej1j0hjeV1N0",
        "_score" : 0.60459375,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "Bought this almost a week ago and it broke on me. Definitely don’t recommend getting this product. We also contacted the buyer and heard nothing back."
          ]
        }
      },
      {
        "_index" : "amazon-review-index-nlp",
        "_id" : "e3B3GYUBej1j0hjeGE7v",
        "_score" : 0.5942412,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "Very very very horrible customer service and product was never seen. Ordered for a christmas gift, then found out they had 8 week shipping, no sooner. When I emailed them, they responded with a very unapologetic response or solution. So yeah I never recieved this product and I would advise no one to order from them!"
          ]
        }
      },
      {
        "_index" : "amazon-review-index-nlp",
        "_id" : "u3B2GYUBej1j0hjetkCU",
        "_score" : 0.5831761,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "Items arrived completely smashed in the box. It was full of broken glass and amazon would not give me a refund without sending the broken pieces back. I did not feel comfortable mailing a broken box full of glass and said no, so will receive no refund."
          ]
        }
      },
      {
        "_index" : "amazon-review-index-nlp",
        "_id" : "SHB3GYUBej1j0hje22Xg",
        "_score" : 0.58303237,
        "fields" : {
          "stars" : [
            1
          ],
          "review_body" : [
            "I bought this for a gift and when opened, the glass was broken and no usable. Very disappointed and embarrassed when my friend opened it up. this was not cheap either, so it should have been packaged better."
          ]
        }
      }
    ]
  }
}

まとめ

本稿では、OpenSearch を使用した類似文書検索として、More like this クエリを使用する検索と、k-NN を使用する検索を紹介しました。

今回の例はあくまで一つの例であり、チューニングやシノニム、カスタム辞書の設定などで結果は変わってきます。また、より適した検索結果を得るために、Neural search の Example request のように、通常の match 検索と k-NN 検索を組み合わせたり、More like this クエリと k-NN を組み合わせた検索を行うことも手段として考えられます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up