More than 1 year has passed since last update.

本記事では、KaggleのCriteo Display Advertising ChallengeにTreasure DataとHivemallを使ってbaselineの結果を提出までの手順を紹介します。

なお、本記事はTreasure Data Advent Calendar 2015の12/8の投稿になります。

Criteo Display Advertising Challengeについて

Criteo Display Advertising Challengeは広告のクリック率を推定するタスクです。訓練事例数は45 millionでサイズが11GBと機械学習のデータセットとしては大きなデータセットです。説明変数は、13の量的変数と26の質的変数からなります。

評価手法はLog lossなので、ロジスティック回帰による予測が有効です。

Beat the Benchmark

MLWaveの人がVowpal Wabbitのロジスティック回帰を使ったベースラインスコアの出し方を以下のサイトで紹介しています。
https://github.com/MLWave/kaggle-criteo

Public leader boardで0.48427、Private leader boardで0.48396程度がロジスティック回帰のベースラインのようです。今回はHivemallのロジスティック回帰 on TDを使って同等の精度を出す手順を紹介します。

トレジャーデータへのデータ投入

# データベースcriteoを作成
td db:create criteo

# 訓練データにIDを振ります。10000000から始まる数字がIDになります。
awk '{IFS="\t"; OFS="\t"} {print 10000000+NR-1,$0}' train.txt > train2.txt

# 訓練事例をtrain_rawテーブルに投入する
td import:auto --format tsv \
--error-records-handling abort \
--columns rowid,label,i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26 \
--column-types long,int,int,int,int,int,int,int,int,int,int,int,int,int,int,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string \
--time-value `date +%s` --auto-create criteo.train_raw train2.txt

# テスト事例にIDを振ります。60000000から始まる数字がIDになります。
awk '{IFS="\t"; OFS="\t"} {print 60000000+NR-1,$0}' test.txt > test2.txt

# テスト事例をtest_rawテーブルに投入する
td import:auto --format tsv \
--error-records-handling abort \
--columns rowid,i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26 \
--column-types long,int,int,int,int,int,int,int,int,int,int,int,int,int,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string,string \
--time-value `date +%s` --auto-create criteo.test_raw test2.txt

Feature Engineering

データの正規化のための統計値の算出

量的変数について、Min-Max正規化を使ったrescalingを行うために、i1-i13についてmin/max値を算出します1

今回は訓練データとテストデータ全体のmin/max値を使います2

td query -w --type presto -d criteo "
select
  min(min1) as min1, max(max1) as max1,
  min(min2) as min2, max(max2) as max2,
  min(min3) as min3, max(max3) as max3,
  min(min4) as min4, max(max4) as max4,
  min(min5) as min5, max(max5) as max5,
  min(min6) as min6, max(max6) as max6,
  min(min7) as min7, max(max7) as max7,
  min(min8) as min8, max(max8) as max8,
  min(min9) as min9, max(max9) as max9,
  min(min10) as min10, max(max10) as max10,
  min(min11) as min11, max(max11) as max11,
  min(min12) as min12, max(max12) as max12,
  min(min13) as min13, max(max13) as max13
from (
  select
    min(i1) as min1, max(i1) as max1,
    min(i2) as min2, max(i2) as max2,
    min(i3) as min3, max(i3) as max3,
    min(i4) as min4, max(i4) as max4,
    min(i5) as min5, max(i5) as max5,
    min(i6) as min6, max(i6) as max6,
    min(i7) as min7, max(i7) as max7,
    min(i8) as min8, max(i8) as max8,
    min(i9) as min9, max(i9) as max9,
    min(i10) as min10, max(i10) as max10,
    min(i11) as min11, max(i11) as max11,
    min(i12) as min12, max(i12) as max12,
    min(i13) as min13, max(i13) as max13
  from
    train_raw
  union all
  select
    min(i1) as min1, max(i1) as max1,
    min(i2) as min2, max(i2) as max2,
    min(i3) as min3, max(i3) as max3,
    min(i4) as min4, max(i4) as max4,
    min(i5) as min5, max(i5) as max5,
    min(i6) as min6, max(i6) as max6,
    min(i7) as min7, max(i7) as max7,
    min(i8) as min8, max(i8) as max8,
    min(i9) as min9, max(i9) as max9,
    min(i10) as min10, max(i10) as max10,
    min(i11) as min11, max(i11) as max11,
    min(i12) as min12, max(i12) as max12,
    min(i13) as min13, max(i13) as max13
  from
    test_raw
) t

+------+------+------+--------+------+-------+------+------+------+----------+------+--------+------+-------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| min1 | max1 | min2 | max2   | min3 | max3  | min4 | max4 | min5 | max5     | min6 | max6   | min7 | max7  | min8 | max8 | min9 | max9  | min10 | max10 | min11 | max11 | min12 | max12 | min13 | max13 |
+------+------+------+--------+------+-------+------+------+------+----------+------+--------+------+-------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| 0    | 5775 | -3   | 257675 | 0    | 65535 | 0    | 969  | 0    | 23159456 | 0    | 431037 | 0    | 56311 | 0    | 6184 | 0    | 29019 | 0     | 11    | 0     | 231   | 0     | 4008  | 0     | 7393  |
+------+------+------+--------+------+-------+------+------+------+----------+------+--------+------+-------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Feature Scaling

まず、Hivemallの訓練用のテーブルtrainを作成します。

quantitative_features関数は量的変数を作るUDF、categorical_features関数は質的変数を作るUDFです。

ここでは、量的変数についてrescale関数でmin-max正規化を行います。array('bias')は線形回帰の切片を付与するものです。

td table:create criteo train

td query -w --type hive -d criteo "
INSERT OVERWRITE TABLE train
select
  rowid, 
  concat_array(
    array('bias'),
    quantitative_features(
      array('i1','i2','i3','i4','i5','i6','i7','i8','i9','i10','i11','i12','i13'),
      rescale(i1, 0, 5775),
      rescale(i2, -3, 257675),
      rescale(i3, 0, 65535),
      rescale(i4, 0, 969),
      rescale(i5, 0, 23159456),
      rescale(i6, 0, 431037),
      rescale(i7, 0, 56311),
      rescale(i8, 0, 6184),
      rescale(i9, 0, 29019),
      rescale(i10, 0, 11),
      rescale(i11, 0, 231),
      rescale(i12, 0, 4008),
      rescale(i13, 0, 7393)
    ),
    categorical_features(
      array('c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26'),
      c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26
    )
  ) as features,
  label
from 
  train_raw
"

上のクエリを実行するとtrainテーブルは次のような形となります3

td query -w --type presto -d criteo "
select rowid, features, label from train limit 5
"

+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| rowid    | features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | label |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| 38141107 | ["bias","i2:1.552325011289213E-5","i5:2.522943541407585E-4","i6:6.495961861219257E-5","i7:3.551703775883652E-5","i8:6.468305364251137E-4","i9:4.8244252684526145E-4","i11:0.008658008649945259","i13:4.057892656419426E-4","c1#241546e0","c2#ae46a29d","c3#c56edf2f","c4#f922efad","c5#4cf72387","c6#fe6b92e5","c7#6362d73a","c8#0b153874","c9#a73ee510","c10#4b28ae0c","c11#ba0f9e8a","c12#66a76a26","c13#4e4dd817","c14#1adce6ef","c15#8ff4b403","c16#01adbab4","c17#e5ba7672","c18#26b3c7a7","c21#21c9516a","c22#ad3062eb","c23#bcdee96c","c24#b34f3128"]                                                                                                                 | 1     |
| 38141108 | ["bias","i1:1.7316016601398587E-4","i2:3.609155537560582E-4","i4:0.0020639835856854916","i5:8.981212886283174E-6","i6:6.959959137020633E-6","i7:3.551703775883652E-5","i8:6.468305364251137E-4","i9:3.4460180904716253E-4","i10:0.09090909361839294","i11:0.008658008649945259","i13:2.7052618679590523E-4","c1#05db9164","c2#09e68b86","c3#1637edfd","c4#b0ed6de7","c5#25c83c98","c7#90cedade","c8#985e3fcb","c9#a73ee510","c10#2e834f9a","c11#c17b047a","c12#0e52e7ba","c13#2e7b87f1","c14#64c94865","c15#91126f30","c16#7804892a","c17#07c540c4","c18#5aed7436","c19#d9aa05dc","c20#5840adea","c21#92791411","c23#32c7478e","c24#1793a828","c25#e8b83407","c26#fa3124de"] | 1     |
| 38141109 | ["bias","i2:2.4837200180627406E-4","i3:3.051804378628731E-5","i4:0.02270381897687912","i5:1.611004990991205E-4","i6:1.9023889035452157E-4","i7:2.663777850102633E-4","i8:0.004204398486763239","i9:0.015920603647828102","i11:0.05627705529332161","i13:0.002975787967443466","c1#05db9164","c2#8084ee93","c3#02cf9876","c4#c18be181","c5#4cf72387","c6#13718bbd","c7#5e64ce5f","c8#0b153874","c9#a73ee510","c10#9de7c14e","c11#8b94178b","c12#8fe001f4","c13#025225f2","c14#b28479f6","c15#b2ff8c6b","c16#36103458","c17#e5ba7672","c18#52e44668","c21#e587c466","c23#32c7478e","c24#3b183c5c"]                                                                             | 0     |
| 38141110 | ["bias","i2:1.1642437129921746E-5","i3:7.019150070846081E-4","i4:0.0010319917928427458","i5:0.002615907695144415","i8:4.8512290231883526E-4","i9:3.446017944952473E-5","i13:4.057892656419426E-4","c1#68fd1e64","c2#287130e0","c3#257f29b7","c4#89526cd2","c5#0942e0a7","c6#fbad5c96","c7#9115f277","c8#f504a6f4","c9#7cc72ec2","c10#5612701e","c11#1d125cbb","c12#921c8174","c13#7f9db67d","c14#1adce6ef","c15#310d155b","c16#7cccf7ff","c17#e5ba7672","c18#891589e7","c19#9653bb65","c20#a458ea53","c21#af4ea70d","c22#ad3062eb","c23#3a171ecb","c24#17b5e269","c25#e8b83407","c26#98aebbef"]                                                                              | 0     |
| 38141111 | ["bias","i2:7.916857721284032E-4","i3:3.051804378628731E-5","i4:0.0010319917928427458","i5:2.1792393818031996E-4","i6:1.4383916277438402E-4","i7:1.775851887941826E-5","i8:3.2341526821255684E-4","i9:0.0011027257423847914","i11:0.0043290043249726295","i13:1.3526309339795262E-4","c1#68fd1e64","c2#2c8c5f5d","c3#13cd0697","c4#352cefe6","c5#25c83c98","c6#fe6b92e5","c7#aa0d873c","c8#c8ddd494","c9#a73ee510","c10#3b08e48b","c11#6a447eb3","c12#c3cdaf85","c13#9dfda2b9","c14#1adce6ef","c15#5edc1a28","c16#08514295","c17#d4bb7bd8","c18#f5f4ae5b","c21#6387fda4","c23#3a171ecb","c24#d36c7dbf"]                                                                      | 0     |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+

同様にテスト用のテーブルを作成します。

td table:create criteo test

td query -w --type hive -d criteo "
INSERT OVERWRITE TABLE test
select
  rowid, 
  concat_array(
    array('bias'),
    quantitative_features(
      array('i1','i2','i3','i4','i5','i6','i7','i8','i9','i10','i11','i12','i13'),
      rescale(i1, 0, 5775),
      rescale(i2, -3, 257675),
      rescale(i3, 0, 65535),
      rescale(i4, 0, 969),
      rescale(i5, 0, 23159456),
      rescale(i6, 0, 431037),
      rescale(i7, 0, 56311),
      rescale(i8, 0, 6184),
      rescale(i9, 0, 29019),
      rescale(i10, 0, 11),
      rescale(i11, 0, 231),
      rescale(i12, 0, 4008),
      rescale(i13, 0, 7393)
    ),
    categorical_features(
      array('c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','c15','c16','c17','c18','c19','c20','c21','c22','c23','c24','c25','c26'),
      c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26
    )
  ) as features
from
  test_raw
"

ロジスティック回帰による学習モデルの作成

logress_modelテーブルにロジスティック回帰を使った学習モデルを構築します。
WITH句のtrain_x3では、繰り返し学習を模擬するために、amplify関数で訓練事例を3倍に増幅した上でCLUSTER BY rand(1)でシャッフルしてロジスティック回帰の学習の入力としています。

td table:create criteo logress_model
td query -w -x --type hive -d criteo "
WITH train_x3 as (
  select
    features, label
  from (
    select 
      amplify(3, features, label) as (features, label)
    from 
      train
  ) t
  CLUSTER BY rand(1)
)
INSERT OVERWRITE TABLE logress_model 
select
  feature,
  avg(weight) as weight
from (
  select 
    logress(features, label) as (feature, weight)
  from
    train_x3
) t
group by feature;
"

ロジスティック回帰の予測モデルを利用してクリック確率を予測

td query -w --type hive -d criteo -f csv --column-header -o logress_predict.csv "
WITH test_exploded as (
select 
  rowid,
  extract_feature(fv) as feature,
  extract_weight(fv) as value
from
  test t1 LATERAL VIEW explode(features) t2 as fv
)
select
  t.rowid as Id,
  sigmoid(sum(m.weight * t.value)) as Predicted
from 
  test_exploded t
  LEFT OUTER JOIN logress_model m ON (t.feature = m.feature)
group by
  t.rowid
"

Kaggleへのsubmit

logress_predict.csvをkaggleにsubmitしますと、Beselineよりちょっと良いスコアが出ました。

Public Leader board Private Leader board
Baseline(VW) 0.48427 0.48396
Hivemall 0.48253 0.48229

さいごに

今回は基本的なロジスティック回帰でCTR predictionを行う例を出しました。

v0.4の新機能のFactorization Machinesの記事を書こうと思ってましたが、Factorization Machinesではなぜか精度がいまいちでしたので、ロジスティック回帰にfallbackしました...orz

コンペのforumやをみるとfeature pairingが有効なよう4なので、精度を上げるための次のステップとしては高次特徴量を作ってみると良いと思います。Hivemallでfeature pairingを行う例はここを参考にどうぞ。


  1. このデータセットだとzscoreを使った方が精度が良いかもしれませんが、今回は試してません。 

  2. i3の-3って異常値でしょうか(?)精度をあげるためには除外するなどの対応も要検討です。コンペのforumの議論を探ると良いと思います。 

  3. Hivemallの特徴量のフォーマットについては、LibSVMやVowpal Wabbitに準じておりますが、詳細はこのページを参照ください。 

  4. https://github.com/swapniel99/criteo 訓練事例数がそれなりにあるので、勾配の最適化のadaXXXはいうほど効かないと思ってますが試していません。Hivemallにもadagrad+logistic lossは使えます。