More than 3 years have passed since last update.

BigQueryでcount distinctを近似集計で高速化したい時のメモ

Posted at 2022-02-08

概要

ユニークユーザ数の集計など、count distinctを実行する場合、BigQueryでも処理時間が膨大になってしまうほどのレコード数を扱いたい時、近似集計が有用である。
BigQueryでは2つの方法があり、その簡単な違いをまとめる。

本編

count distinct集計は、集計対象のユニーク数に依存したメモリを必要とするため、大規模なデータほど大きな計算コストが伴う。
そのため、集計を厳密性を捨てて、高速で集計するアルゴリズムが存在する。

BigQueryでは、以下の近似集計関数が用意されている。

APPROX_COUNT_DISTINCT
HyperLogLog++関数 (HLL_COUNT.INIT → HLL_COUNT.MERGE)

近似計算の仕組み

どちらも仕組みは同様で、条件が同じであれば結果も一致する（後述）

c.f. HyperLogLog sketch in practice -異なり数の集計の悩みはほぼ解決！-

サンプル

公開データセットでの例が以下の通り(データをちゃんと確認おらず意味のある集計かは置いておく)

近似計算例

select 
    dt,
    sum(repository_unique_num) as repository_unique_num,
    sum(approx_repository_unique_num) as approx_repository_unique_num,
    hll_count.merge(sketch_repository_unique_num) as hll_repository_unique_num,
from (
    select 
        date(created_at) as dt,
        count(distinct repository_url) as repository_unique_num,
        approx_count_distinct(repository_url) as approx_repository_unique_num,
        hll_count.init(repository_url) as sketch_repository_unique_num,
    from 
        `bigquery-public-data.samples.github_timeline`
    where
        created_at is not null
    group by 1
)
group by 1
order by 1

APPROX_COUNT_DISTINCT
- 通常の集計関数と同様に利用できる
HyperLogLog++関数
- HLL_COUNT.INITでsketchを集計した後、これをHLL_COUNT.MERGEで参照する
- sketchを保存しておくことで集計時の再利用性が高まる

どちらも精度は厳密集計時に対する相対誤差が1~2%ほどで収まる
（HyperLogLog++関数では精度を指定することができる）

HyperLogLog++関数の違い

HyperLogLog++関数(今後HLLと呼ぶ)を採用するメリットは複数存在する

(BigQueryの仕様として)オプションで精度を変更することができる
- init時にオプションを設定することで、近似計算の精度を調整できる
- デフォルトは15
- 大きくするほど精度は高くなる一方、sketchの必要サイズが大きくなり、計算コストが上がる

精度オプション

select 
    dt,
    sum(repository_unique_num) as repository_unique_num,
    hll_count.merge(sketch_repository_unique_num) as hll_repository_unique_num,
    hll_count.merge(sketch20_repository_unique_num) as hll20_repository_unique_num,
from (
    select 
        date(created_at) as dt,
        count(distinct repository_url) as repository_unique_num,
        hll_count.init(repository_url) as sketch_repository_unique_num,
        hll_count.init(repository_url, 20) as sketch20_repository_unique_num,
    from 
        `bigquery-public-data.samples.github_timeline`
    where
        created_at is not null
    group by 1
)
group by 1
order by 1

sketchを利用して、効率よくユニーク数集計ができる
- sketchデータを中間テーブル化し、それを活用することで効率的に集計ができる

sketch再利用の例

with
sketch_data as (
    select 
        date(created_at) as dt,
        hll_count.init(repository_url) as sketch_repository_unique_num,
    from 
        `bigquery-public-data.samples.github_timeline`
    where
        created_at is not null
    group by 1
)

select
    'daily' as type,
    dt,
    hll_count.merge(sketch_repository_unique_num) as repository_unique_num,
from
    sketch_data
group by 1, 2
union all
select
    'monthly' as type,
    date_trunc(dt, month) as dt,
    hll_count.merge(sketch_repository_unique_num) as repository_unique_num,
from
    sketch_data
group by 1, 2

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up