More than 3 years have passed since last update.

BigQueryのStandardSQLでランダムランブリング

Posted at 2021-04-07

概要

BigQueryのStandardSQLだとrand()関数にseedを指定することができないので、再現性を担保した上で任意の件数のランダムサンプリングの実装が面倒。
簡易的な代替手段を備忘のためまとめる。

やり方

ユニークIDを元にハッシュ値でソートし、rankをつける
- ユニークIDに任意の文字列をseedの代わりに結合して扱う
rankで任意の件数を取得する

StandardSQLでランダムサンプリング

with
rand_sample as (
    select
        * except(rank)
    from (
        select
            word,
            dense_rank() over (order by farm_fingerprint(concat('seed', word)), word) as rank,
        from
            (select distinct word from `publicdata.samples.shakespeare`)
    )
    where
        rank <= 5
)

select * from rand_sample

参考

BigQuery でランダムサンプリング

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up