Elasticsearchで完全一致検索を実現する

Posted at 2024-05-12

はじめに

以下の記事の続きです。

記事の最後で、以下の課題があると述べました。

完全一致検索がしたい。

完全一致検索を実現するために検討したことや直面した課題、最終的にどのような手段で実現したかについて、書き記しておきます。

実現方法

調べてみたところ、実現方法が2種類あることが分かりました。

Wildcard query

ワイルドカード演算子( * )を使って、keyword型のフィールドを検索するクエリです。

ワイルドカード演算子は、0文字以上の文字列と一致します。
例えば以下のクエリを実行した場合、Qia、Qiia、Qiita などの文字列を含むドキュメントが検索にヒットします。

GET my-index/_search

{
  "query": {
    "wildcard": {
      "contents.keyword": {
        "value": "Qi*a",
        "boost": 1.0,
        "rewrite": "constant_score_blended"
      }
    }
  }
}

検索語句の前後に * を付与すれば、完全一致検索と同じ動きになります。

GET my-index/_search

{
  "query": {
    "wildcard": {
      "contents.keyword": {
        "value": "*Qiita*",
        "boost": 1.0,
        "rewrite": "constant_score_blended"
      }
    }
  }
}

参考資料

Match phrase query

match_phrase を使って、text型のフィールドを検索するクエリです。
検索語句が含まれており、かつ語句の登場順が同じであるドキュメントを取得します。

例えば以下のクエリを実行した場合、「this is a test」を含むドキュメントが検索にヒットしますが、「is this a test」を含むドキュメントはヒットしません。

GET my-index/_search

{
  "query": {
    "match_phrase": {
      "contents.ngram": {
        "query": "this is a test"
      }
    }
  }
}

multi_matchクエリで利用したい場合は、typeにphraseを指定します。

GET my-index/_search

{
  "query": {
    "multi_match": {
        "query": "this is a test",
        "fields": ["title.ngram^1","contents.ngram^1"],
        "type": "phrase"
    }
  }
}

参考資料

Wildcard query を試してみる

まずは、Wildcard queryを試してみます。
Wildcard queryを利用するために、keyword型のフィールドを新たに追加します。

PUT /my-index/_mapping

{
  "properties": {
    "contents": {
      "type": "text",
      "search_analyzer": "my_custom_kuromoji_search_analyzer",
      "analyzer": "my_custom_kuromoji_index_analyzer",
      "fields": {
        "ngram": {
          "type": "text",
          "search_analyzer": "my_custom_ngram_search_analyzer",
          "analyzer": "my_custom_ngram_index_analyzer"
        },
        "keyword": {
          "type": "keyword"
        }
      }
    }
  }
}

あとはcontentsフィールドにデータを登録すれば完全一致検索ができる...と言いたいところですが、検索用のデータを登録した際にある問題が発生しました。

Logstashでデータを一括登録したのですが、いくつかのドキュメントの取込みでエラーが発生していました。
ログを確認すると、以下のようなメッセージが出力されています。

"reason"=>"Document contains at least one immense term in field=\"contents.keyword\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[-29, -127, -118, -25, -106, -78, -29, -126, -116, -26, -89, -104, -29, -127, -89, -29, -127, -103, -29, -128, -126, -26, -75, -123, -26, -76, -91, -29, -127, -89]...'"

どうやら、keyword型のフィールドには32766バイト以上の文字列を登録できないようです。

今回のケースでは、検索対象に32766バイト以上の文字列を含むドキュメントが存在しており、その数も決して少なくありません。

32766バイトに収まるようにドキュメントを分割して登録するといった対応も不可能ではありませんが、それにかかる工数や認知負荷の増大を考慮すると、手放しで進めてしまうのは良くなさそうだなと感じました。

また、keyword型のフィールドを新たに用意することでインデックスが肥大化してしまうことも懸念していたので、この方法は一旦保留としました。

Match phrase query を試してみる

次に、Match phrase query を試してみます。
こちらはtext型のフィールドさえあれば使えるので、マッピングとデータの作り直しは必要なさそうです。

以下のようなクエリで完全一致検索ができることを確認しました。

GET /my-index/_search

{
  "query": {
    "multi_match": {
      "query": "(検索語句)",
      "fields": [
        "title.ngram^1",              
        "contents.ngram^1"
      ],
      "type": "phrase"
    }
  }
}

部分一致検索と組み合わせて利用する場合は、以下のようにmust句に完全一致検索の条件を追加すれば実現できそうです。

{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "(部分一致検索)",
            "fields": [
              "title.ngram^1",              
              "contents.ngram^1"
            ],
            "type": "best_fields"
          }
        },
        {
          "multi_match": {
            "query": "(完全一致検索)",
            "fields": [
              "title.ngram^1",              
              "contents.ngram^1"
            ],
            "type": "phrase"
          }
        }
      ],
      "should": [
        {
          "multi_match": {
            "query": "(部分一致検索)",
            "fields": [
              "title^3",
              "contents^3"
            ],
            "type": "best_fields"
          }
        }
      ]
    }
  }
}

まとめ

Elasticsearchにおいて完全一致検索を実現する方法について調べました。
検索対象の文字列のサイズがkeyword型のフィールドの上限を超えるケースがあったため、今回はMatch phrase queryを利用することにしました。
実際に運用し始めてから3ヵ月程度経ちましたが、現時点では満足しています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up