More than 5 years have passed since last update.

[elasticsearch2.0] Pipeline Aggregationを試す -Avg/Max/Min/Sum Aggregation

Elasticsearch

Last updated at 2015-11-13Posted at 2015-11-13

テーマ

「日本のさくら名所100選を都道府県別に集計する」

使用データ

http://linkdata.org/work/rdf1s544i?key=#work_information

環境

MacBook Pro (Retina, 15-inch, Mid 2014)
2.2 GHz Intel Core i7
16 GB 1600 MHz DDR3
OS X El Capitan 10.11（15A284）
Elasticsearch2.0.0

準備

bulk indexingが可能なjson形式に加工（詳細略）
index作成

curl -s -H 'Content-Type: application/json' -XPUT localhost:9200/100_cherry -d '{
  "settings": {
    "index": {
      "number_of_replicas": 0, // 1台構成クラスタなので
      "number_of_shards": 1,
      "refresh_interval": -1 // これは個人的な好み
    }
  }
}'

mapping設定

curl -X PUT 'localhost:9200/100_cherry/_mapping/doc' -d '{
  "properties": {
    "wikipedia_url": {
      "type": "string",
      "index": "not_analyzed" 
    },
    "location": {
      "type": "string",
      "index": "not_analyzed" 
    },
    "pref": {
      "type": "string",
      "index": "not_analyzed" 
    },
    "geo_point": {
      "type" : "geo_point"
    }
  }
}'

Query

query.json

curl -XGET "http://localhost:9200/100_cherry/_search?search_type=count" -d'
{
  "query": {
    "match_all": {} // 全ドキュメント取得
  },
  "aggs": {
    "pref": {
      "terms": {
        "field": "pref", // 都道府県ごとに名所の件数を集計
        "size": 5 // 5つのbucketを返す
      }
    },
    "max": {
      "max_bucket": { // 最大
        "buckets_path": "pref._count"
      }
    },
    "min": {
      "min_bucket": { // 最小
        "buckets_path": "pref._count"
      }
    },
    "ave": {
      "avg_bucket": { // 平均
        "buckets_path": "pref._count"
      }
    },
    "sum": {
      "sum_bucket": { // 合計
        "buckets_path": "pref._count"
      }
    }
  }
}'

結果

response.json

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 100,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "pref": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 81, // sizeで指定した数のbucketに含まれないドキュメント
      "buckets": [
        {
          "key": "東京都",
          "doc_count": 5
        },
        {
          "key": "京都府",
          "doc_count": 4
        },
        {
          "key": "愛知県",
          "doc_count": 4
        },
        {
          "key": "兵庫県",
          "doc_count": 3
        },
        {
          "key": "千葉県",
          "doc_count": 3
        }
      ]
    },
    "max": {
      "value": 5,
      "keys": [
        "東京都"
      ]
    },
    "min": {
      "value": 3,
      "keys": [
        "兵庫県",
        "千葉県"
      ]
    },
    "ave": {
      "value": 3.8
    },
    "sum": {
      "value": 19
    }
  }
}

まとめと感想

上位のaggでsizeを指定するとその数のbucketの中で最大・最小・平均・合計を出す
クエリも機能もわかりやすく想像通りの動きをしてくれる
オープンデータを使ってみた

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up