Elasticsearchで完全一致っぽいことをやる

Posted at 2023-12-06

ディップ Advent Calendar 2023の記事です。

はじめに

RDBからElasticsearchに移行する際に、「このフリーワード検索、どうしても既存仕様(LIKE検索)と同じ結果にならないと困るなぁ・・・」といった声に頭を抱えることがあると思います。（私は何度か頭を抱えました。）
こちらの記事が何かしらのヒントになると幸いです。

Elasticsearchのバージョンは8.5系です。

求められていること

freeword_textに格納されているワードを完全一致検索させたい。
freeword_textには「text-1」「text_1」などが入る。
もちろん、「text-1」と「text_1」は別物として扱う。

対処法１：wildcardを利用する

公式ドキュメント Wildcard query

名称の通り、クエリにワイルドカードを指定することができます。

まずはMappingを定義します。

Mapping

PUT test-index
{
  "mappings": {
    "properties": {
      "freeword_text": {
        "type": "keyword",
        "ignore_above": "256"
      }
    }
  }
}

データを入れます。

PUT test-index/_doc/1
{
  "freeword_text":"【text-1】フリーワード検索用のデータ"
}

検索してみましょう。

# クエリ
GET test-index/_search
{
  "query": {
    "wildcard": {
      "freeword_text": {
        "value": "*text-1*"
      }
    }
  }
}

# 結果
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test-index",
        "_id": "1",
        "_score": 1,
        "_source": {
          "freeword_text": "【text-1】フリーワード検索用のデータ"
        }
      }
    ]
  }
}

割愛しますが、text_1で検索した結果も問題なしでした。

やりたいことが実現できそうなのですが、負荷が高いクエリらしく
本番環境で頻繁に実行されることが想定されるため、採用は見送りました。

対処法２：match_phraseを利用する

公式ドキュメント Match phrase query

簡単に説明するとフレーズ検索です。
もう少し詳細に説明すると、クエリに指定した単語が同じ順序で含まれているものをHITさせます。

まずはMappingを投入します。

mapping

PUT test-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "kuromoji_analyzer": {
          "tokenizer": "kuromoji_tokenizer"
        }
      }
    }
  },
  "mappings": {
      "properties": {
        "freeword_text": {
          "type": "text",
          "analyzer": "kuromoji_analyzer"
        }
      }
    }
}

検索してみましょう。

# クエリ
GET test-index/_search
{
  "query": {
    "match_phrase": {
      "freeword_text": "text-1"
    }
  }
}

# 結果
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test-index",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "freeword_text": "【text-1】フリーワード検索用のデータ"
        }
      }
    ]
  }
}

いい感じですね！！！
ではtext_1で検索してみましょう。

# クエリ
GET test-index/_search
{
  "query": {
    "match_phrase": {
      "freeword_text": "text_1"
    }
  }
}

# 結果
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test-index",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "freeword_text": "【text-1】フリーワード検索用のデータ"
        }
      }
    ]
  }
}

・・・・なるほど？

_analyzeを利用して、どのように分かち書きがされているか確認してみます。

GET test-index/_analyze
{
  "analyzer": "kuromoji_analyzer",
  "text": ["text-1"]
}

# 結果
{
  "tokens": [
    {
      "token": "text",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    },
    {
      "token": "1",
      "start_offset": 9,
      "end_offset": 10,
      "type": "word",
      "position": 1
    }
  ]
}

kuromoji_analyzerで記号が除外されたのが原因でした・・・。
以下のようにMappingを修正してみました。

mapping・改

PUT test-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "bigram_analyzer": {
          "tokenizer": "bigram_tokenizer"
        }
      },
      "tokenizer": {
        "bigram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 2
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "freeword_text": {
        "type": "text",
        "analyzer": "bigram_analyzer"
      }
    }
  }
}

意図した結果になるか確認してみます。

# クエリ
GET test-index/_search
{
  "query": {
    "match_phrase": {
      "freeword_text": "text-1"
    }
  }
}

# 結果
{
  "took": 259,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test-index",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "freeword_text": "【text-1】フリーワード検索用のデータ"
        }
      }
    ]
  }
}

OKそうですね！
では問題のtext_1ではどうでしょうか？

# クエリ
GET test-index/_search
{
  "query": {
    "match_phrase": {
      "freeword_text": "text_1"
    }
  }
}

# 結果
{
  "took": 963,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

これで要件が満たせそうです

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up