More than 3 years have passed since last update.

Elasticsearch の検索結果に一貫性がなかった（inconsistent）ので preference 指定して対処した

Elasticsearch

Last updated at 2021-04-10Posted at 2021-03-26

概要

Elasticserach（以降Es）7.x系において、検索をした場合、ヒット数が paginate が発生するくらい多い（あるいは結果のsizeを限定していた場合）と最初（ページロールする前）に現れるドキュメントに一貫性がなかった（inconsistent）
- 厳密には、スコアが変わることでランキング結果がかわり、１ページ目に現れるドキュメントが変わっていたぽい
これは、 https://www.elastic.co/guide/en/elasticsearch/reference/current/consistent-scoring.html#_scores_are_not_reproducible でも述べられている通り仕方のない問題である
検索時に preference に適切なカスタム文字列を設定することで対処できた

詳細（というほどでもないけど）

原因
- https://www.elastic.co/guide/en/elasticsearch/reference/current/consistent-scoring.html#_scores_are_not_reproducible でも述べられているように、Esは、検索時にクエリを各Shardにルーティングをしている
- Shard によって統計データ（スコア計算に使う）が異なるため、ルーティングされたShardによってランキング結果が微妙に変わってしまっていた
  - https://medium.com/thron-tech/why-scaling-elasticsearch-broke-our-ranking-and-how-we-fixed-it-e603b60f0c05 によると、この統計データは、DF / IDF のことらしい
  - 転置インデックスなど、全文検索の仕組みを考えれば、それはそうかという感じ
対処方法
- 検索時に preference に適当な文字列を渡せば、同じ文字列が渡されるならばルーティング先を保証してくれるので、結果、スコアリングにも一貫性がでて、検索結果に一貫性をもたせられる
  - https://www.elastic.co/guide/en/elasticsearch/reference/current/consistent-scoring.html#_scores_are_not_reproducible
  The recommended way to work around this issue is to use a string that identifies the user that is logged in (a user id or session id for instance) as a preference. This ensures that all queries of a given user are always going to hit the same shards, so scores remain more consistent across queries.
- 上でも述べられているように、カスタム文字列は、user_id や session id などのように、ある程度値が分散するような文字列にするべき
  - でないと、shardの負荷が偏ってしまうためだと思われる
  - 一方で、再現性をもたせたい範囲では同じ文字列を渡す必要があるはず（それが、セッション単位なのか、ユーザ単位なのか、ユーザグループ単位なのかは場合によって異なる）
結果
- 再現性がでるようになった（結果が同じになった）
残った疑問
- custom-string に渡せるカスタム文字列って、制限（文字種、長さ）ないのかな？
  - 以下を読む限り、_ で始まらなければ何でも良さそう？
  - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-preference

Any string that does not start with _. If the cluster state and selected shards do not change, searches using the same value are routed to the same shards in the same order.
- 検証した限り、本当に任意の文字列でいけそうではあるが。。。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up