DDL
CREATE TABLE test_running_sum(
inventory_id UInt32,
user_id UInt32,
timestamp DateTime
) engine = MergeTree()
ORDER BY (inventory_id, timestamp)
DML
INSERT INTO test_running_sum VALUES
(0, 1, '2021-01-01 10:00:00'),
(0, 2, '2021-01-01 10:00:00'),
(0, 3, '2021-01-01 10:00:00'),
(0, 1, '2021-01-01 11:00:00'),
(0, 4, '2021-01-01 11:00:00');
こうなって
inventory_id | user_id | timestamp |
---|---|---|
0 | 1 | 2021-01-01 10:00:00 |
0 | 2 | 2021-01-01 10:00:00 |
0 | 3 | 2021-01-01 10:00:00 |
0 | 1 | 2021-01-01 11:00:00 |
0 | 4 | 2021-01-01 11:00:00 |
実際に集計するときは uniqCombinedMerge
/ uniqCombinedState
を WINDOW と一緒に使うと
SET allow_experimental_window_functions = 1; -- 2021/05/23 時点では必要
SELECT date_trunc('hour', timestamp) as hour
, uniqCombined(user_id) as UU
, uniqCombinedMerge(uniqCombinedState(user_id)) OVER (ORDER BY date_trunc('hour', timestamp)) as `Running UU`
FROM test_running_sum
GROUP BY date_trunc('hour', timestamp)
ORDER BY date_trunc('hour', timestamp);
ちゃんとこうなる
hour | UU | Running UU |
---|---|---|
2021-01-01 10:00:00 | 3 | 3 |
2021-01-01 11:00:00 | 2 | 4 |
真の UU との比較がしたければ uniqCombined
を uniqExact
に変えたものも出せば良い
SELECT date_trunc('hour', timestamp) as hour
, uniqCombined(user_id) as `UU`
, uniqCombinedMerge(uniqCombinedState(user_id)) OVER (ORDER BY date_trunc('hour', timestamp)) as `Running UU`
, uniqExact(user_id) as `Exact UU`
, uniqExactMerge(uniqExactState(user_id)) OVER (ORDER BY date_trunc('hour', timestamp)) as `Exact Running UU`
FROM test_running_sum
GROUP BY date_trunc('hour', timestamp)
ORDER BY date_trunc('hour', timestamp);
当然このサイズでは差が出ない
hour | UU | Running UU | Exact UU | Exact Running UU |
---|---|---|---|---|
2021-01-01 10:00:00 | 3 | 3 | 3 | 3 |
2021-01-01 11:00:00 | 2 | 4 | 2 | 4 |
Reference
- ClickHouse Documentation
- 2019 年の時点で Sate/Merge に言及していてスゴイ