OpenSearch 日本語のハイブリッド検索（ベクトル＋キーワード）に対応した OpenSearch 環境構築方法

Last updated at 2025-01-07Posted at 2024-08-31

前提

Embedding model に multilingual-E5 を用いる
OpenSearch は Docker 環境を用いる

OpenSearch の準備

docker-compose

以下の docker-compose.yml を任意のディレクトリに用意する

OPENSEARCH_INITIAL_ADMIN_PASSWORD の値に独自のパスワードを指定しておく (user: admin, password: ここで指定したパスワード) でログインすることになる。

docker-compose.yml

version: '3'
services:
  opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    image: opensearchproject/opensearch:latest # Specifying the latest available image - modify if you want a specific version
    container_name: opensearch-node1b
    environment:
      - cluster.name=opensearch-cluster # Name the cluster
      - node.name=opensearch-node1 # Name the node that will run in this container
      - discovery.type=single-node # シングルノードモードを有効化
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD={強力なパスワード}    # Sets the demo admin user password when using demo configuration, required for OpenSearch 2.12 and later
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:latest # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards2
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200"]' # 単一のノードを指定
    networks:
      - opensearch-net

volumes:
  opensearch-data1:

networks:
  opensearch-net:

※本家のサンプルはノードが２つになっているので１つにしてあります

参考

Try OpenSearch with Docker Compose

docker-compose

docker-compose up -d

日本語対応

素のコンテナでは日本語に対応するためのコンポーネントがインストールされていないので、container_name: opensearch-node1b において以下のコマンドを実行する

docker exec -it opensearch-node1b bash

実行するコマンド1

/usr/share/opensearch/bin/opensearch-plugin install analysis-kuromoji

実行するコマンド2

/usr/share/opensearch/bin/opensearch-plugin install analysis-icu

実行結果

sh-5.2$ /usr/share/opensearch/bin/opensearch-plugin install analysis-kuromoji
-> Installing analysis-kuromoji
-> Downloading analysis-kuromoji from opensearch
[=================================================] 100%?? 
-> Installed analysis-kuromoji with folder name analysis-kuromoji
sh-5.2$ /usr/share/opensearch/bin/opensearch-plugin install analysis-icu
-> Installing analysis-icu
-> Downloading analysis-icu from opensearch
[=================================================] 100%?? 
-> Installed analysis-icu with folder name analysis-icu
sh-5.2$

※プラグインのインストールを反映させるために再起動が必要（再起動してなくてハマった）

Docker の準備は以上.

管理コンソールを開く

http://localhost:5601/ を開く

パスワードは docker-compose.yml で指定したものを用いる

ログインできたら Interact with the OpenSearch API を開く

インデックスを作成する

日本語対応の設定を以下のコマンドで行う

Index 作成

PUT /hello-hybrid
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    },
    "analysis": {
      "char_filter": {
        "normalize": {
          "type": "icu_normalizer",
          "name": "nfkc",
          "mode": "compose"
        }
      },
      "analyzer": {
        "default": {
          "type": "custom",
          "char_filter": [
            "icu_normalizer",
            "kuromoji_iteration_mark"
          ],
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "kuromoji_baseform",
            "kuromoji_part_of_speech",
            "ja_stop",
            "kuromoji_number",
            "kuromoji_stemmer"
          ]
        },
        "kuromoji_analyzer": {
          "type": "custom",
          "char_filter": [
            "icu_normalizer",
            "kuromoji_iteration_mark"
          ],
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "kuromoji_baseform",
            "kuromoji_part_of_speech",
            "ja_stop",
            "kuromoji_number",
            "kuromoji_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text_txt_ja": {
        "type": "text",
        "analyzer": "kuromoji_analyzer",
        "search_analyzer": "kuromoji_analyzer",
        "term_vector": "with_positions_offsets",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "vector1024": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
}

response

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "hello-hybrid"
}

この OpenSearch のコマンドは hello-hybrid という名前のインデックスを作成し、特定の設定とマッピングを定義しています。

インデックス設定
KNN設定: knnをtrueに設定することで、このインデックスはk近傍法によるベクトル検索を有効にします。knn.algo_param.ef_searchは検索時の探索パラメーターを指定し、ここでは 100 と設定しています。

解析設定: 文字フィルター、トークナイザー、トークンフィルターを含むカスタムのテキスト解析器を定義しています。

文字フィルター: icu_normalizer はICUライブラリの正規化を行い、kuromoji_iteration_markは繰り返し記号を処理します。

トークナイザー: kuromoji_tokenizer は日本語の形態素解析を行います。

トークンフィルター: kuromoji_baseform, kuromoji_part_of_speech, ja_stop, kuromoji_number, kuromoji_stemmer はそれぞれ、基本形への変換、品詞のフィルタリング、日本語のストップワードの除去、数詞の処理、語幹の抽出を行います。

マッピング設定

テキストフィールド: text_txt_ja は、kuromoji_analyzer を使用して解析されるテキスト型のフィールドです。term_vector はトークンの位置とオフセットを保存します。

ベクトルフィールド: vector1024は、1024次元のベクトルデータを保存するための knn_vector型です。ここで使用される hnswは効率的なグラフベースの検索アルゴリズムで、l2距離（ユークリッド距離）を使用します。ef_construction と m はHNSWアルゴリズムのパラメーターで、それぞれグラフの探索とノードの接続性に影響します。

このコマンドはテキストとベクトルデータの両方を扱うハイブリッドな検索機能を提供するインデックスを作成することを目的としています。この設定は、特に多言語の文書や複雑なデータ型を扱う場合に有効です。

※ベクトルのサイズを 1024次元にしているので、簡易に試したい場合は２次元などにしてください。

参考

スコアリングスクリプトを使用した正確な k-NN

tokenizer を試す

GET /hello-hybrid/_analyze
{
  "text": "今日はいい天気でしたので歩いて学校に行きました。ＡＢＣ"
}

response

{
  "tokens": [
    {
      "token": "今日", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0
    },
    {
      "token": "いい", "start_offset": 3, "end_offset": 5, "type": "word", "position": 2
    },
    {
      "token": "天気", "start_offset": 5, "end_offset": 7, "type": "word", "position": 3
    },
    {
      "token": "歩く", "start_offset": 12, "end_offset": 14, "type": "word", "position": 7
    },
    {
      "token": "学校", "start_offset": 15, "end_offset": 17, "type": "word", "position": 9
    },
    {
      "token": "行く", "start_offset": 18, "end_offset": 20, "type": "word", "position": 11
    },
    {
      "token": "abc", "start_offset": 24, "end_offset": 27, "type": "word", "position": 14
    }
  ]
}

動詞として「行きました」から「行く」が抽出され、全角アルファベットの「ＡＢＣ」が「abc」に正規化されていることがわかります。（個人の感想としてはここに「POS（品詞）」が出力されていないことが惜しい！）

パイプライン処理の設定

PUT /_search/pipeline/nlp-search-pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.3,
              0.7
            ]
          }
        }
      }
    }
  ]
}

コマンド構造
PUT /_search/pipeline/nlp-search-pipeline: この部分では、OpenSearch に対して nlp-search-pipeline という名前の新しい検索パイプラインを作成または更新するためのリクエストを行っています。PUT メソッドは、指定されたリソースが存在しない場合は作成し、存在する場合は更新を行います。

パイプライン定義
"description": パイプラインの説明を文字列で提供します。ここでは「Post processor for hybrid search」と記述されており、ハイブリッド検索のための後処理ステップを意味しています。

プロセッサーの構成
"phase_results_processors": この配列内に、検索結果に対して適用される一連のプロセッサーを定義します。

プロセッサーの詳細
"normalization-processor": ここでは正規化プロセッサーを設定しています。このプロセッサーは検索結果のスコアなどの数値を正規化するために使用されることが一般的です。
"normalization": 正規化の技術を定義します。ここでは "min_max" 技術が使われており、これは値を最小値0、最大値1の範囲にスケーリングする方法です。
"combination": 正規化された値をどのように組み合わせるかを定義します。
"technique": 組み合わせ技術として "arithmetic_mean" が選ばれています。これは算術平均を取ることを意味し、複数のスコアを平均化します。
"parameters": 組み合わせに使用されるパラメータを指定します。
"weights": 各スコアの重み付けを配列で指定します。この例では、最初のスコアには 0.3 の重みを、二番目のスコアには 0.7 の重みを設定しています。
このコマンドにより、OpenSearch は検索結果の後処理を行うためのパイプラインを設定し、それによってより適切な検索結果のランキングや表示を行うことができます。

参考になるページ

文書の追加

_doc を使って文書を追加します。ベクトルが長いので省略しますが、multilingual-E5-large で embedding したものと想定します。

POST /hello-hybrid/_doc/1
{
  "text_txt_ja": "私は学校に歩いて行きます。",
  "vector1024": [
    0.042572115,
    -0.0038737254,
    -0.0010998837,
    ...
    -0.010681156,
    0.030802153
  ]
}

参考

手軽に multilingual-E5-large を使う方法については以下が参考になります。

キーワード検索を試す

request

GET /hello-hybrid/_search
{
  "_source": {
    "exclude": [
      "vector1024"
    ]
  },
  "query": {
    "term": {
      "text_txt_ja": "学校"
    }
  }
}

response

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "hello-hybrid",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "text_txt_ja": "私は学校に歩いて行きます。",
          "id": "1",
          "items": []
        }
      }
    ]
  }
}

参考

Keyword search

ハイブリッド検索を試す

search_pipeline=nlp-search-pipeline でパイプライン処理（＝ハイブリッド検索）を指定するのがポイントです。

request

GET /hello-hybrid/_search?search_pipeline=nlp-search-pipeline
{
  "size": 10,
  "_source": {
    "exclude": [
      "vector1024"
    ]
  },
  "highlight": {
    "fields": {
      "text_txt_ja": {}
    }
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "term": {
            "text_txt_ja": "行く"
          }
        },
        {
          "knn": {
            "vector1024": {
              "vector": [
                0.018971957,
                -0.01261955,
                -0.017597636,
                ...
                -0.0044123465,
                -0.024646945,
                0.016892584
              ],
              "k": 3
            }
          }
        }
      ]
    }
  }
}

response(文書IDが上までのサンプルと異なっています)

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "hello-hybrid",
        "_id": "doc_20240831-203405_0",
        "_score": 1.0,
        "_source": {
          "text_txt_ja": "私は学校に歩いて行きます。",
          "id": "doc_20240831-203405_0",
          "items": []
        },
        "highlight": {
          "text_txt_ja": [
            "私は学校に歩いて\u003cem\u003e行き\u003c/em\u003eます。"
          ]
        }
      },
      {
        "_index": "hello-hybrid",
        "_id": "doc_20240831-205750_0",
        "_score": 1.0,
        "_source": {
          "text_txt_ja": "私は学校に歩いて行きます。",
          "id": "doc_20240831-205750_0",
          "items": []
        },
        "highlight": {
          "text_txt_ja": [
            "私は学校に歩いて\u003cem\u003e行き\u003c/em\u003eます。"
          ]
        }
      },
      {
        "_index": "hello-hybrid",
        "_id": "doc_20240831-205750_1",
        "_score": 0.007894406,
        "_source": {
          "text_txt_ja": "私はメジャーリーグが好きです。",
          "id": "doc_20240831-205750_1",
          "items": []
        }
      },
      {
        "_index": "hello-hybrid",
        "_id": "doc_20240831-205750_2",
        "_score": 7.0000003E-4,
        "_source": {
          "text_txt_ja": "私はべーブルースが好きです。",
          "id": "doc_20240831-205750_2",
          "items": []
        }
      }
    ]
  }
}

参考

Hybrid query

以上でハイブリッド検索ができている（はず）です。（もうちょっと検証したい）

以上.

他に参考になりそうなページ

Text analysis

k-NN search with filters

OpenSearch + Sudachi を 0 から構築する

プラグインを作りたくなったら↓を見るのがよさそうです。

elasticsearch-kuromoji

OpenSearchで日本語の検索ができるようにする 2022/05/26 に公開

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up