BigQueryで統計量を出す時に使うクエリメモ

More than 1 year has passed since last update.


これは何か

window関数を駆使するとUDFを使わなくとも簡潔に統計量が出せる。がよく書き方を忘れるのでメモ


基本

-- 平均

AVG(expr)

-- 中央値0.1%誤差
NTH(501, QUANTILES(expr, 1001))

-- 最頻値
TOP(expr, 1), COUNT(*) as count

-- 分散
VARIANCE(expr)
-- 母集団の分散 (分母が n-1 の方)
VARIANCE_POP(expr)

-- 標準偏差
STDDEV(expr)
-- 母集団の標準偏差 (分母が n-1 の方)
STDDEV_POP(expr)

-- 変動係数
STDDEV(expr)/AVG(expr)

-- ピアソン相関係数
CORR(expr1, expr2)


情報量

Public DatasetsのUSA Namesから、ジェンダー別に名前毎の情報量

select

gender,
name,
number,
-log2(p_name) as bits
from (
select
gender,
name,
number,
ratio_to_report(number) over (partition by gender) as p_name,
from [bigquery-public-data:usa_names.usa_1910_2013]
where state = 'CA'
)


エントロピー

Public DatasetsのUSA Namesから、ジェンダー毎に名前のエントロピー

select

state,
gender,
sum(p_name * -log2(p_name)) as entropy
from (
select
state,
gender,
name,
number,
ratio_to_report(number) over (partition by gender) as p_name,
from [bigquery-public-data:usa_names.usa_1910_2013]
)
group by state, gender
order by entropy


KL-Divergence

KLDivergence = \sum_iP(i)log_2\frac{P(i)}{Q(i)}

KaggleのClick-Through Rate Predictionのデータを例に、広告枠毎にhourlyのインプレッション数とクリック数の確率分布の距離を取る

select

site_id,
sum(p_imp * log2(p_imp/(p_click + 0.0001))) as KLD_imp_and_click
from (
select
site_id,
hour,
imp,
click,
ratio_to_report(imp) over (partition by site_id) as p_imp,
ratio_to_report(click) over (partition by site_id) as p_click
from (
select
site_id,
hour,
count(*) as imp,
sum(click) as click,
from [hagino3000-bq-sandbox:ctr_competition.train]
group by site_id, hour
order by site_id, hour
)
)
group by site_id
order by KLD_imp_and_click


時系列データの移動平均と差分系列

同じくKaggleのClick-Through Rate Predictionで広告枠ごとの3時間移動平均。欠損値を補完したい場合は軸を別途作ってJoinする事。

select

site_id,
hour,
avg(imp) over (partition by site_id
order by hour
ROWS BETWEEN 3 preceding AND CURRENT ROW) as rolling_mean
from (
select
site_id,
hour,
count(*) as imp
from [hagino3000-bq-sandbox:ctr_competition.train]
group by site_id, hour
)
order by site_id, hour


差分系列

select

site_id,
hour,
imp,
imp - ifnull(previous_imp, imp) as residual
from (
select
site_id,
hour,
imp,
lag(imp, 1) over (partition by site_id order by hour) as previous_imp
from (
select
site_id,
hour,
count(*) as imp
from [hagino3000-bq-sandbox:ctr_competition.train]
group by site_id, hour
)
)
order by site_id, hour


正規化した中央絶対偏差

ロバストな標準偏差の推定値

MADN = \frac{median(|x_i - mean(x)|)}{0.674}

select 

a.stn as stn,
a.year as year,
a.mo as month,
a.da as day,
a.temp as temp,
avg(a.temp) over (partition by a.stn) as temp_mu,
stddev(a.temp) over (partition by a.stn) as temp_std,
b.median as temp_median,
b.madn as temp_madn
from [bigquery-public-data:noaa_gsod.gsod1929] as a
inner join
(
select
org.stn as stn,
agg.median as median,
nth(501, quantiles(abs(org.temp - agg.median), 1001))/0.675 as madn,
from [bigquery-public-data:noaa_gsod.gsod1929] as org
inner join
(
select
stn, nth(501, quantiles(temp, 1001)) as median
from [bigquery-public-data:noaa_gsod.gsod1929]
group by stn
) as agg on org.stn = agg.stn
group by stn, median
) as b on a.stn = b.stn


Z値

BigQuery public datasetのNOAA Weather Dataから、ステーション毎に気温のZ値

select

stn,
year,
mo,
da,
temp,
(temp - mu)/std AS Z
from (
select
stn,
year,
mo,
da,
temp,
avg(temp) over (partition by stn) as mu,
stddev(temp) over (partition by stn) as std
from [bigquery-public-data:noaa_gsod.gsod1929]
)
order by stn, year, mo, da