More than 5 years have passed since last update.

【ウィンドウ関数】各グループごとの最大値行だけを取得する【Athena】

Posted at 2020-09-28

グループ毎に最大値を取ってくるのが難しい

特定カラムの最大値を取ってくるだけであればMAX関数を利用するだけで解決するので特に悩むところはないと思われます。mysql例
しかし、以下のようなテーブルからuser_id毎の最大値を取ってくる場合、一工夫いります。

user_id	score
1	200
3	120
2	140
2	220
3	100
4	50
5	270
4	500
5	10

今回はAthenaで実行したのでcreate文が以下となっています。

create.sql

CREATE EXTERNAL TABLE IF NOT EXISTS tmp.max_test (
  `user_id` int,
  `score` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://example_bucket/'
TBLPROPERTIES ('has_encrypted_data'='false');

ウィンドウ関数の利用

この場合、ウィンドウ関数のrow_number関数を利用して各グループ、今回であればuser_id毎にスコアに降順の順位ををつけてその上位１位を取得することで最大値行の一覧を引っ張ってこれます。

select.sql

select
    user_id,
    score
from
    (
        SELECT
            user_id,
            score,
            ROW_NUMBER() over(partition by user_id ORDER BY score desc) AS score_rank
        FROM
            tmp.max_test
    )
where
    score_rank = 1

簡単に解説

サブクエリ内のROW_NUMBER関数のoverの括弧内partitionでグループを分けます。今回はuser_id毎に最大値を取得したいのでuser_idですが、複数のカラムを使ってpartitionを分けることも可能です。その場合はカンマでつなげます。
orderby句でそのpartition内の順番を決めます。今回はscoreを降順で並べたいのでscore_rank descとします。

サブクエリの結果はこのようになります。

user_id	score	score_rank
1	200	1
1	150	2
2	220	1
2	140	2
3	120	1
3	100	2
4	500	1
4	50	2
5	270	1
5	10	2

これをみると後はwhere score_rank = 1で持ってくればいいというのがはっきりわかります。

MySqlでも同様の関数がありました。Window Function Descriptions。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up