本記事では、KaggleのCriteo Display Advertising ChallengeにTreasure DataとHivemallを使ってbaselineの結果を提出までの手順を紹介します。
なお、本記事はTreasure Data Advent Calendar 2015の12/8の投稿になります。
Criteo Display Advertising Challengeについて
Criteo Display Advertising Challengeは広告のクリック率を推定するタスクです。訓練事例数は45 millionでサイズが11GBと機械学習のデータセットとしては大きなデータセットです。説明変数は、13の量的変数と26の質的変数からなります。
評価手法はLog lossなので、ロジスティック回帰による予測が有効です。
Beat the Benchmark
MLWaveの人がVowpal Wabbitのロジスティック回帰を使ったベースラインスコアの出し方を以下のサイトで紹介しています。
https://github.com/MLWave/kaggle-criteo
Public leader boardで0.48427、Private leader boardで0.48396程度がロジスティック回帰のベースラインのようです。今回はHivemallのロジスティック回帰 on TDを使って同等の精度を出す手順を紹介します。
トレジャーデータへのデータ投入
# データベースcriteoを作成
td db:create criteo
# 訓練データにIDを振ります。10000000から始まる数字がIDになります。
awk '{IFS="\t"; OFS="\t"} {print 10000000+NR-1,$0}' train.txt > train2.txt
# 訓練事例をtrain_rawテーブルに投入する
td import:auto --format tsv \
--error-records-handling abort \
--columns rowid,label,i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26 \
--column-types long,int,int,int,int,int,int,int,int,int,int,int,int,int,int,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string \
--time-value `date +%s` --auto-create criteo.train_raw train2.txt
# テスト事例にIDを振ります。60000000から始まる数字がIDになります。
awk '{IFS="\t"; OFS="\t"} {print 60000000+NR-1,$0}' test.txt > test2.txt
# テスト事例をtest_rawテーブルに投入する
td import:auto --format tsv \
--error-records-handling abort \
--columns rowid,i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26 \
--column-types long,int,int,int,int,int,int,int,int,int,int,int,int,int,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string \
--time-value `date +%s` --auto-create criteo.test_raw test2.txt
Feature Engineering
データの正規化のための統計値の算出
量的変数について、Min-Max正規化を使ったrescalingを行うために、i1-i13についてmin/max値を算出します1。
今回は訓練データとテストデータ全体のmin/max値を使います2。
td query -w --type presto -d criteo "
select
min(min1) as min1, max(max1) as max1,
min(min2) as min2, max(max2) as max2,
min(min3) as min3, max(max3) as max3,
min(min4) as min4, max(max4) as max4,
min(min5) as min5, max(max5) as max5,
min(min6) as min6, max(max6) as max6,
min(min7) as min7, max(max7) as max7,
min(min8) as min8, max(max8) as max8,
min(min9) as min9, max(max9) as max9,
min(min10) as min10, max(max10) as max10,
min(min11) as min11, max(max11) as max11,
min(min12) as min12, max(max12) as max12,
min(min13) as min13, max(max13) as max13
from (
select
min(i1) as min1, max(i1) as max1,
min(i2) as min2, max(i2) as max2,
min(i3) as min3, max(i3) as max3,
min(i4) as min4, max(i4) as max4,
min(i5) as min5, max(i5) as max5,
min(i6) as min6, max(i6) as max6,
min(i7) as min7, max(i7) as max7,
min(i8) as min8, max(i8) as max8,
min(i9) as min9, max(i9) as max9,
min(i10) as min10, max(i10) as max10,
min(i11) as min11, max(i11) as max11,
min(i12) as min12, max(i12) as max12,
min(i13) as min13, max(i13) as max13
from
train_raw
union all
select
min(i1) as min1, max(i1) as max1,
min(i2) as min2, max(i2) as max2,
min(i3) as min3, max(i3) as max3,
min(i4) as min4, max(i4) as max4,
min(i5) as min5, max(i5) as max5,
min(i6) as min6, max(i6) as max6,
min(i7) as min7, max(i7) as max7,
min(i8) as min8, max(i8) as max8,
min(i9) as min9, max(i9) as max9,
min(i10) as min10, max(i10) as max10,
min(i11) as min11, max(i11) as max11,
min(i12) as min12, max(i12) as max12,
min(i13) as min13, max(i13) as max13
from
test_raw
) t
+------+------+------+--------+------+-------+------+------+------+----------+------+--------+------+-------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| min1 | max1 | min2 | max2 | min3 | max3 | min4 | max4 | min5 | max5 | min6 | max6 | min7 | max7 | min8 | max8 | min9 | max9 | min10 | max10 | min11 | max11 | min12 | max12 | min13 | max13 |
+------+------+------+--------+------+-------+------+------+------+----------+------+--------+------+-------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 5775 | -3 | 257675 | 0 | 65535 | 0 | 969 | 0 | 23159456 | 0 | 431037 | 0 | 56311 | 0 | 6184 | 0 | 29019 | 0 | 11 | 0 | 231 | 0 | 4008 | 0 | 7393 |
+------+------+------+--------+------+-------+------+------+------+----------+------+--------+------+-------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
Feature Scaling
まず、Hivemallの訓練用のテーブルtrain
を作成します。
quantitative_features
関数は量的変数を作るUDF、categorical_features
関数は質的変数を作るUDFです。
ここでは、量的変数についてrescale
関数でmin-max正規化を行います。array('bias')は線形回帰の切片を付与するものです。
td table:create criteo train
td query -w --type hive -d criteo "
INSERT OVERWRITE TABLE train
select
rowid,
concat_array(
array('bias'),
quantitative_features(
array('i1','i2','i3','i4','i5','i6','i7','i8','i9','i10','i11','i12','i13'),
rescale(i1, 0, 5775),
rescale(i2, -3, 257675),
rescale(i3, 0, 65535),
rescale(i4, 0, 969),
rescale(i5, 0, 23159456),
rescale(i6, 0, 431037),
rescale(i7, 0, 56311),
rescale(i8, 0, 6184),
rescale(i9, 0, 29019),
rescale(i10, 0, 11),
rescale(i11, 0, 231),
rescale(i12, 0, 4008),
rescale(i13, 0, 7393)
),
categorical_features(
array('c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26'),
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26
)
) as features,
label
from
train_raw
"
上のクエリを実行するとtrain
テーブルは次のような形となります3。
td query -w --type presto -d criteo "
select rowid, features, label from train limit 5
"
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| rowid | features | label |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| 38141107 | ["bias","i2:1.552325011289213E-5","i5:2.522943541407585E-4","i6:6.495961861219257E-5","i7:3.551703775883652E-5","i8:6.468305364251137E-4","i9:4.8244252684526145E-4","i11:0.008658008649945259","i13:4.057892656419426E-4","c1#241546e0","c2#ae46a29d","c3#c56edf2f","c4#f922efad","c5#4cf72387","c6#fe6b92e5","c7#6362d73a","c8#0b153874","c9#a73ee510","c10#4b28ae0c","c11#ba0f9e8a","c12#66a76a26","c13#4e4dd817","c14#1adce6ef","c15#8ff4b403","c16#01adbab4","c17#e5ba7672","c18#26b3c7a7","c21#21c9516a","c22#ad3062eb","c23#bcdee96c","c24#b34f3128"] | 1 |
| 38141108 | ["bias","i1:1.7316016601398587E-4","i2:3.609155537560582E-4","i4:0.0020639835856854916","i5:8.981212886283174E-6","i6:6.959959137020633E-6","i7:3.551703775883652E-5","i8:6.468305364251137E-4","i9:3.4460180904716253E-4","i10:0.09090909361839294","i11:0.008658008649945259","i13:2.7052618679590523E-4","c1#05db9164","c2#09e68b86","c3#1637edfd","c4#b0ed6de7","c5#25c83c98","c7#90cedade","c8#985e3fcb","c9#a73ee510","c10#2e834f9a","c11#c17b047a","c12#0e52e7ba","c13#2e7b87f1","c14#64c94865","c15#91126f30","c16#7804892a","c17#07c540c4","c18#5aed7436","c19#d9aa05dc","c20#5840adea","c21#92791411","c23#32c7478e","c24#1793a828","c25#e8b83407","c26#fa3124de"] | 1 |
| 38141109 | ["bias","i2:2.4837200180627406E-4","i3:3.051804378628731E-5","i4:0.02270381897687912","i5:1.611004990991205E-4","i6:1.9023889035452157E-4","i7:2.663777850102633E-4","i8:0.004204398486763239","i9:0.015920603647828102","i11:0.05627705529332161","i13:0.002975787967443466","c1#05db9164","c2#8084ee93","c3#02cf9876","c4#c18be181","c5#4cf72387","c6#13718bbd","c7#5e64ce5f","c8#0b153874","c9#a73ee510","c10#9de7c14e","c11#8b94178b","c12#8fe001f4","c13#025225f2","c14#b28479f6","c15#b2ff8c6b","c16#36103458","c17#e5ba7672","c18#52e44668","c21#e587c466","c23#32c7478e","c24#3b183c5c"] | 0 |
| 38141110 | ["bias","i2:1.1642437129921746E-5","i3:7.019150070846081E-4","i4:0.0010319917928427458","i5:0.002615907695144415","i8:4.8512290231883526E-4","i9:3.446017944952473E-5","i13:4.057892656419426E-4","c1#68fd1e64","c2#287130e0","c3#257f29b7","c4#89526cd2","c5#0942e0a7","c6#fbad5c96","c7#9115f277","c8#f504a6f4","c9#7cc72ec2","c10#5612701e","c11#1d125cbb","c12#921c8174","c13#7f9db67d","c14#1adce6ef","c15#310d155b","c16#7cccf7ff","c17#e5ba7672","c18#891589e7","c19#9653bb65","c20#a458ea53","c21#af4ea70d","c22#ad3062eb","c23#3a171ecb","c24#17b5e269","c25#e8b83407","c26#98aebbef"] | 0 |
| 38141111 | ["bias","i2:7.916857721284032E-4","i3:3.051804378628731E-5","i4:0.0010319917928427458","i5:2.1792393818031996E-4","i6:1.4383916277438402E-4","i7:1.775851887941826E-5","i8:3.2341526821255684E-4","i9:0.0011027257423847914","i11:0.0043290043249726295","i13:1.3526309339795262E-4","c1#68fd1e64","c2#2c8c5f5d","c3#13cd0697","c4#352cefe6","c5#25c83c98","c6#fe6b92e5","c7#aa0d873c","c8#c8ddd494","c9#a73ee510","c10#3b08e48b","c11#6a447eb3","c12#c3cdaf85","c13#9dfda2b9","c14#1adce6ef","c15#5edc1a28","c16#08514295","c17#d4bb7bd8","c18#f5f4ae5b","c21#6387fda4","c23#3a171ecb","c24#d36c7dbf"] | 0 |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
同様にテスト用のテーブルを作成します。
td table:create criteo test
td query -w --type hive -d criteo "
INSERT OVERWRITE TABLE test
select
rowid,
concat_array(
array('bias'),
quantitative_features(
array('i1','i2','i3','i4','i5','i6','i7','i8','i9','i10','i11','i12','i13'),
rescale(i1, 0, 5775),
rescale(i2, -3, 257675),
rescale(i3, 0, 65535),
rescale(i4, 0, 969),
rescale(i5, 0, 23159456),
rescale(i6, 0, 431037),
rescale(i7, 0, 56311),
rescale(i8, 0, 6184),
rescale(i9, 0, 29019),
rescale(i10, 0, 11),
rescale(i11, 0, 231),
rescale(i12, 0, 4008),
rescale(i13, 0, 7393)
),
categorical_features(
array('c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26'),
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26
)
) as features
from
test_raw
"
ロジスティック回帰による学習モデルの作成
logress_model
テーブルにロジスティック回帰を使った学習モデルを構築します。
WITH句のtrain_x3
では、繰り返し学習を模擬するために、amplify関数で訓練事例を3倍に増幅した上でCLUSTER BY rand(1)
でシャッフルしてロジスティック回帰の学習の入力としています。
td table:create criteo logress_model
td query -w -x --type hive -d criteo "
WITH train_x3 as (
select
features, label
from (
select
amplify(3, features, label) as (features, label)
from
train
) t
CLUSTER BY rand(1)
)
INSERT OVERWRITE TABLE logress_model
select
feature,
avg(weight) as weight
from (
select
logress(features, label) as (feature, weight)
from
train_x3
) t
group by feature;
"
ロジスティック回帰の予測モデルを利用してクリック確率を予測
td query -w --type hive -d criteo -f csv --column-header -o logress_predict.csv "
WITH test_exploded as (
select
rowid,
extract_feature(fv) as feature,
extract_weight(fv) as value
from
test t1 LATERAL VIEW explode(features) t2 as fv
)
select
t.rowid as Id,
sigmoid(sum(m.weight * t.value)) as Predicted
from
test_exploded t
LEFT OUTER JOIN logress_model m ON (t.feature = m.feature)
group by
t.rowid
"
Kaggleへのsubmit
logress_predict.csv
をkaggleにsubmitしますと、Beselineよりちょっと良いスコアが出ました。
Public Leader board | Private Leader board | |
---|---|---|
Baseline(VW) | 0.48427 | 0.48396 |
Hivemall | 0.48253 | 0.48229 |
さいごに
今回は基本的なロジスティック回帰でCTR predictionを行う例を出しました。
v0.4の新機能のFactorization Machinesの記事を書こうと思ってましたが、Factorization Machinesではなぜか精度がいまいちでしたので、ロジスティック回帰にfallbackしました...orz
コンペのforumやをみるとfeature pairingが有効なよう4なので、精度を上げるための次のステップとしては高次特徴量を作ってみると良いと思います。Hivemallでfeature pairingを行う例はここを参考にどうぞ。
-
i3の-3って異常値でしょうか(?)精度をあげるためには除外するなどの対応も要検討です。コンペのforumの議論を探ると良いと思います。 ↩
-
Hivemallの特徴量のフォーマットについては、LibSVMやVowpal Wabbitに準じておりますが、詳細はこのページを参照ください。 ↩
-
https://github.com/swapniel99/criteo 訓練事例数がそれなりにあるので、勾配の最適化のadaXXXはいうほど効かないと思ってますが試していません。Hivemallにもadagrad+logistic lossは使えます。 ↩