私が考えたBigQueryでのGROUPING SETS的なこと

Last updated at 2022-12-07Posted at 2022-12-07

BigQueryって cube や grouping setsができないんですよね。(rollup はある。)

cube の代替として unnestを利用して全てのディメンションをクロスする方法があるみたい(stack overflow参照ですが、全部の組み合わせをやってしまうとレコード数爆発の危険性があるので好みではないです。。。

ということで、お好きな組み合わせを自分で指定できる grouping sets 的なことをBigQueryでやってみました。

こんなシンプルなevent_logsテーブルがあったとして、各ディメンションのグループごとにuser_idのユニーク数を求めるとします。

クエリはこんな感じになります。

with 
base as (
  select * from event_logs
  where event_date between '2022-01-01' and '2022-12-31'
),

groupings as (
  select *, ['event_date'] from base
  union all
  select *, ['event_date', 'os_type'] from base
  union all
  select *, ['event_date', 'os_type', 'device_type'] from base
  union all
  select *, ['device_type'] from base
)

select
  if('event_date' in unnest(group_keys), event_date, null) as event_date,
  if('device_type' in unnest(group_keys), device_type, null) as device_type,
  if('os_type' in unnest(group_keys), os_type, null) as os_type,
  array_to_string(group_keys, ",") as group_keys,
  APPROX_COUNT_DISTINCT(user_id) as uu_count
from groupings
group by 1,2,3,4

細かくは説明しないですが、groupingsのCTEで集計したいディメンションのグループを指定しています。

仮にevent_dateとdevice_typeの組み合わせで集計したい、という要望が新たに挙がった場合はgroupingsに union all を１つ追加してあげればいいです。

そしてそれらのディメンションの組み合わせは最終的なクエリ結果に group_keys というカラムに出てくるので、どの行がどんなディメンションの組み合わせで集計された結果なのかが一目瞭然になるという工夫をしています。

そんなに複雑じゃなく、メンテナンスもしやすいと個人的には満足してます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up