More than 5 years have passed since last update.

ElasticSearchのAggregationsで有効な項目を集計して取得する

Last updated at 2019-10-08Posted at 2019-10-08

プルダウンとか作るために、検索条件の候補を ElasticSearch から Aggregations で取得しようとしたけどデフォルトだと全件とれない。
公式ドキュメント:Aggregation
公式ドキュメント:Composite Aggregation

集計クエリ

お店の商品に紐づくカテゴリを全件取得するクエリの例

GET _search
{
  "size": 0,
  "aggregations": {
    "category": {
      "terms": {
        "field": "category.keyword"
      }
    }
  },
  "query": {
    "match": {
      "shop": "00000001"
    }
  }
}

結果を見ると全ての件数が集計されてない
doc_count_error_upper_bound, sum_other_doc_count が0より大きいと、集計漏れの可能性がある

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 10632,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "category" : {
      "doc_count_error_upper_bound" : 190, // 集計漏れのドキュメントの数、0になってほしい
      "sum_other_doc_count" : 6857, // 集計漏れの項目のユニークな数、0になってほしい
      "buckets" : [
        {
          "key" : "book",
          "doc_count" : 10621
        },
        // 略 ※デフォルトなので計10件
        {
          "key" : "novel",
          "doc_count" : 10580
        }
      ]
    }
  }
}

集計漏れを(なるべく)ふせぐ方法

aggregations は上位項目を取得するものなので、全項目である保証はされない
とはいえ、 size や shard_size に十分大きい値（今回のデータだと1000とかでOK)を入れることで、 doc_count_error_upper_bound, sum_other_doc_count を(一応)0にできる


GET _search
{
  "size": 0,
  "aggregations": {
    "category": {
      "terms": {
        "field": "category.keyword",
        "size": 1000 // デフォルト size=10
      }
    }
  },
  "query": {
    "match": {
      "shop": "00000001"
    }
  }
}

GET _search
{
  "size": 0,
  "aggregations": {
    "category": {
      "terms": {
        "field": "category.keyword",
        "shard_size": 2000 // デフォルト shard_size=10*1.5+10=25
      }
    }
  },
  "query": {
    "match": {
      "shop": "00000001"
    }
  }
}

公式推奨の方法

公式によると you should use the Composite aggregation which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation
意訳:「size に大きい値を入れるんじゃなくって、 Composite aggregation を使ってね」
ということで、正しくは Composite Aggregation でページネーションしながら全件取得する

after を指定してページングしながら全件取得する

GET /_search
{
  "size": 0,
  "aggregations": {
    "category": {
      "composite": {
        "size": 10, // ページングテストのため少なめに10
        "after": { "category" : "novel" }, // 前ページのクエリの after_key の値をセット
        "sources": [
          {
            "category": {
              "terms": {
                "field": "category.keyword"
              }
            }
          }
        ]
      }
    }
  },
  "query": {
    "match": {
      "shop": "00000001"
    }
  }
}

以下頑張ってページングする

{
  "took" : 23,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 10632,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "category" : {
      "after_key" : {
        "category" : "photo" // このページの最後の値、次のページを問い合わせるときの after_key にセットする
      },
      "buckets" : [
        {
          "key" : {
            "category" : "magazine"
          },
          "doc_count" : 5
        },
        // 略 ※計10件
        {
          "key" : {
            "category" : "photo" // このページの最後の値、次のページを問い合わせるときの after_key にセットする
          },
          "doc_count" : 31
        }
      ]
    }
  }
}

おわりに

上位だけ取れればいい用途だったので、 size 指定でよかった

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up