More than 5 years have passed since last update.

BigQuery でランダムサンプリング

Last updated at 2017-11-09Posted at 2017-11-08

備忘録代わりに自分がよく使う BigQuery を利用したランダムサンプリング方法を書いておく。間違ってたら誰かに訂正してもらおうという魂胆。

単純なランダムサンプル

発想: 再現可能なランダムな数字を適当に作って並べ替える、あるいは MOD を取る。

x% サンプリング

SELECT
  id,
  date
FROM table.YYYYMMDD
WHERE FARM_FINGERPRINT(CONCAT(id, date)) % 100 = 0;

サンプルサイズを固定したサンプリング

SELECT
  id,
  date,
  FARM_FINGERPRINT(CONCAT(id, date)) hash
FROM table.YYYYMMDD
ORDER BY hash DESC
LIMIT $sample_size;

重み付けランダムサンプル

発想: Reservoir sampling を使う。再現可能なランダムな数値を 1 / weight 乗する (rand(0,1)^(1/weight))。

サンプルサイズを固定した重み付けサンプリング、復元なし

SELECT
  id,
  date,
  weight,
  POW(FARM_FINGERPRINT(CONCAT(id, date)) / POW(2, 64),
    1 / weight) AS hash
FROM table.YYYYMMDD
WHERE weight > 0
ORDER BY hash DESC
LIMIT $sample_size;

サンプルサイズを固定した重み付けサンプリング、復元あり

発想: サンプルの対象となるそれぞれの観測要素をサンプルサイズ個複製する。

WITH numbers AS (
  SELECT * FROM UNNEST(GENERATE_ARRAY(1, $sample_size)) num
)

SELECT
  id,
  num,
  date,
  weight,
  POW(FARM_FINGERPRINT(FORMAT('%s%d%s', id, num, date)) / POW(2, 64),
    1 / weight) AS hash
FROM table.YYYYMMDD table, numbers
WHERE weight > 0
ORDER BY hash
LIMIT $sample_size;

以上。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up