More than 5 years have passed since last update.

TreasureDataでhivemallを使ってみる。

TreasureData

Posted at 2014-06-23

はじめに

hivemallとは、Hiveのクエリで機械学習が行える統計ライブラリです。
産総研の油井さんが開発されています。

TreasureDataのRelease Note 20140617にこっそりと、"Backend: Hivemall Library v0.2 Upgraded the Hivemall Library to version v0.2. Its capabilities are immediately available to all users querying using the Hive engine."
と書かれていたので、試してみようと思います。

また、今回は参考資料の1.で紹介されている処理の流れをなるべくそのままTreasureData上の処理に置き換えて実施していきます。

データセットとして、KDD Cup 2012, Track 2のCTR推定タスク用のデータセットとなっています。
これは、中国の3大検索エンジンの一つsoso.comの実検索エンジンのデータとのことです。
サイトのDownloadから、track2.zipとtest.zipとKDD_Track2_solution.csv
をダウンロードします。

参考資料

データの準備

BulkImportを使って各レコードをアップロードします。
データのスキーマについては、こちらのスキーマを参考にしました。
注意点としては、rowidはデータ内には入ってないようなので、アップロード時点では省略します。また、カラム名には、大文字が使えない。timeを散けて付与しておくことが処理効率のために重要であることです。

$ td db:create t_hivemall

$ td table:create t_hivemall training

$ td import:upload training.txt --auto-create t_hivemall.training --format tsv --columns clicks,impression,displayurl,adid,advertiserid,depth,position,queryid,keywordid,titleid,descriptionid,userid --column-types int,int,string,int,int,int,int,int,int,int,int,int --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

td table:create t_hivemall user

td import:upload userid_profile.txt --auto-create t_hivemall.user --format tsv --columns userid,gender,age --column-types int,int,int --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

td table:create t_hivemall query

td import:upload queryid_tokensid.txt --auto-create t_hivemall.query --format tsv --columns queryid,tokens --column-types int,string --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

td table:create t_hivemall keyword

td import:upload purchasedkeywordid_tokensid.txt --auto-create t_hivemall.keyword --format tsv --columns keywordid,tokens --column-types int,string --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

td table:create t_hivemall title
td import:upload titleid_tokensid.txt --auto-create t_hivemall.title --format tsv --columns titleid,tokens --column-types int,string --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

td table:create t_hivemall description

td import:upload descriptionid_tokensid.txt --auto-create t_hivemall.description --format tsv --columns descriptionid,tokens --column-types int,string --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

solution.sh

td table:create t_hivemall solution

td import:upload KDD_Track2_solution.csv --auto-create t_hivemall.solution --format csv --columns clicks,impressions,private --column-types int,int,string --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

td table:create t_hivemall test

td import:upload test.txt --auto-create t_hivemall.test --format tsv --columns displayurl,adid,advertiserid,depth,position,queryid,keywordid,titleid,descriptionid,userid --column-types string,int,int,int,int,int,int,int,int,int --time-value 0,24 -o pre/ --auto-perform --auto-commit --parallel 8 --prepare-parallel 8 --error-records-output err/

Row_IDの付与

solution, training, testについては、row_idを付与するためにTD_X_RANKを利用する。

solution.sql

SELECT 
  TD_X_RANK(id) AS row_id,
clicks,impressions,private,time
FROM (
    SELECT 
      '1' AS id,
clicks,impressions,private,time
    FROM
      solution
    ORDER BY
      id
  ) t

Hashing trickとバイアス項

hashing.sql

select
  row_id, time,
  array(displayurl, adid, advertiserid, depth, position, queryid, keywordid, titleid, descriptionid, userid, gender, age, bias) 
    as features
from (
select
  row_id,time,
  mhash(concat("1:", displayurl)) as displayurl, 
  mhash(concat("2:", adid)) as adid, 
  mhash(concat("3:", advertiserid)) as advertiserid, 
  mhash(concat("4:", depth)) as depth, 
  mhash(concat("5:", position)) as position, 
  mhash(concat("6:", queryid)) as queryid, 
  mhash(concat("7:", keywordid)) as keywordid, 
  mhash(concat("8:", titleid)) as titleid, 
  mhash(concat("9:", descriptionid)) as descriptionid, 
  mhash(concat("10:", userid)) as userid, 
  mhash(concat("11:", COALESCE(gender,"0"))) as gender, 
  mhash(concat("12:", COALESCE(age,"-1"))) as age, 
  -1 as bias
from (
select
  t.*,
  u.gender,
  u.age
from 
  testing_row t 
  LEFT OUTER JOIN user u 
    on t.userid = u.userid
) t1
) t2

*記事自体は、hivemallの処理が正しくでき次第、整形して別なブログに投稿する予定です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up