これは何か
window関数を駆使するとUDFを使わなくとも簡潔に統計量が出せる。がよく書き方を忘れるのでメモ
基本
-- 平均
AVG(expr)
-- 中央値0.1%誤差
NTH(501, QUANTILES(expr, 1001))
-- 最頻値
TOP(expr, 1), COUNT(*) as count
-- 分散
VARIANCE(expr)
-- 母集団の分散 (分母が n-1 の方)
VARIANCE_POP(expr)
-- 標準偏差
STDDEV(expr)
-- 母集団の標準偏差 (分母が n-1 の方)
STDDEV_POP(expr)
-- 変動係数
STDDEV(expr)/AVG(expr)
-- ピアソン相関係数
CORR(expr1, expr2)
情報量
Public DatasetsのUSA Namesから、ジェンダー別に名前毎の情報量
select
gender,
name,
number,
-log2(p_name) as bits
from (
select
gender,
name,
number,
ratio_to_report(number) over (partition by gender) as p_name,
from [bigquery-public-data:usa_names.usa_1910_2013]
where state = 'CA'
)
エントロピー
Public DatasetsのUSA Namesから、ジェンダー毎に名前のエントロピー
select
state,
gender,
sum(p_name * -log2(p_name)) as entropy
from (
select
state,
gender,
name,
number,
ratio_to_report(number) over (partition by gender) as p_name,
from [bigquery-public-data:usa_names.usa_1910_2013]
)
group by state, gender
order by entropy
KL-Divergence
KLDivergence = \sum_iP(i)log_2\frac{P(i)}{Q(i)}
KaggleのClick-Through Rate Predictionのデータを例に、広告枠毎にhourlyのインプレッション数とクリック数の確率分布の距離を取る
select
site_id,
sum(p_imp * log2(p_imp/(p_click + 0.0001))) as KLD_imp_and_click
from (
select
site_id,
hour,
imp,
click,
ratio_to_report(imp) over (partition by site_id) as p_imp,
ratio_to_report(click) over (partition by site_id) as p_click
from (
select
site_id,
hour,
count(*) as imp,
sum(click) as click,
from [hagino3000-bq-sandbox:ctr_competition.train]
group by site_id, hour
order by site_id, hour
)
)
group by site_id
order by KLD_imp_and_click
時系列データの移動平均と差分系列
同じくKaggleのClick-Through Rate Predictionで広告枠ごとの3時間移動平均。欠損値を補完したい場合は軸を別途作ってJoinする事。
select
site_id,
hour,
avg(imp) over (partition by site_id
order by hour
ROWS BETWEEN 3 preceding AND CURRENT ROW) as rolling_mean
from (
select
site_id,
hour,
count(*) as imp
from [hagino3000-bq-sandbox:ctr_competition.train]
group by site_id, hour
)
order by site_id, hour
差分系列
select
site_id,
hour,
imp,
imp - ifnull(previous_imp, imp) as residual
from (
select
site_id,
hour,
imp,
lag(imp, 1) over (partition by site_id order by hour) as previous_imp
from (
select
site_id,
hour,
count(*) as imp
from [hagino3000-bq-sandbox:ctr_competition.train]
group by site_id, hour
)
)
order by site_id, hour
正規化した中央絶対偏差
ロバストな標準偏差の推定値
MADN = \frac{median(|x_i - mean(x)|)}{0.674}
select
a.stn as stn,
a.year as year,
a.mo as month,
a.da as day,
a.temp as temp,
avg(a.temp) over (partition by a.stn) as temp_mu,
stddev(a.temp) over (partition by a.stn) as temp_std,
b.median as temp_median,
b.madn as temp_madn
from [bigquery-public-data:noaa_gsod.gsod1929] as a
inner join
(
select
org.stn as stn,
agg.median as median,
nth(501, quantiles(abs(org.temp - agg.median), 1001))/0.675 as madn,
from [bigquery-public-data:noaa_gsod.gsod1929] as org
inner join
(
select
stn, nth(501, quantiles(temp, 1001)) as median
from [bigquery-public-data:noaa_gsod.gsod1929]
group by stn
) as agg on org.stn = agg.stn
group by stn, median
) as b on a.stn = b.stn
Z値
BigQuery public datasetのNOAA Weather Dataから、ステーション毎に気温のZ値
select
stn,
year,
mo,
da,
temp,
(temp - mu)/std AS Z
from (
select
stn,
year,
mo,
da,
temp,
avg(temp) over (partition by stn) as mu,
stddev(temp) over (partition by stn) as std
from [bigquery-public-data:noaa_gsod.gsod1929]
)
order by stn, year, mo, da