今週1週間に新規登録して昨日コンバージョン達成した人の、今週1週間のアクティビティを出したい。
select * from
activity a
LEFT SEMI JOIN
(select data_user_id from activity b
LEFT SEMI JOIN
(select data_user_id from activity
where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
and data_type = 'premium_registration_complete'
) c
ON b.data_user_id = c.data_user_id
where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
and data_type = 'signup'
) d
ON a.data_user_id = d.data_user_id
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
;
かなり時間かかる。
サブクエリだけで実行
select data_user_id from activity b
LEFT SEMI JOIN
(select data_user_id from activity
where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
and data_type = 'premium_registration_complete'
) c
ON b.data_user_id = c.data_user_id
where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
and data_type = 'signup'
;
started at 2014-03-18T17:50:50Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1622188538.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T17:52:43Z
処理は終わるけど警告は残ったまま。
MAPJOIN
上のケースは1日以内にプレミアム登録した人という小さなテーブルとJOINしているので、MAPJOINを使ってみる。
select /*+ MAPJOIN(c) */ b.data_user_id from activity b
JOIN
(select data_user_id from activity
where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
and data_type = 'premium_registration_complete'
) c
ON b.data_user_id = c.data_user_id
where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
and data_type = 'signup'
;
started at 2014-03-18T17:58:48Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__243448987.txt
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/4080/4080_20140318175858_9bf02d3e-4672-46bb-9bdf-e2f4a6edef14.log
2014-03-18 05:58:59 Starting to launch local task to process map join; maximum memory = 763363328
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
2014-03-18 05:59:34 Processing rows: 1556 Hashtable size: 1556 Memory usage: 30751296 rate: 0.04
2014-03-18 05:59:34 Dump the hashtable into file: file:/tmp/4080/hive_2014-03-18_17-58-56_440_7019263692592835974/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable
2014-03-18 05:59:35 Upload 1 File to: file:/tmp/4080/hive_2014-03-18_17-58-56_440_7019263692592835974/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable File size: 60967
2014-03-18 05:59:35 End of local task; Time Taken: 36.034 sec.
**
** Time indices:
** Time index: [2014-03-11 17:58:48 +0000, 2014-03-18 17:58:47 +0000)
**
finished at 2014-03-18T18:00:40Z
まだ警告出てるけど、クエリが終わらないということはなくなった。
更にサブクエリだけで実行
select data_user_id from activity
where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
and data_type = 'premium_registration_complete'
;
これだときちんとtimeインデックスを使ってくれてるみたい。
started at 2014-03-18T18:05:52Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1081435809.txt
**
** Time indices:
** Time index: [2014-03-17 18:05:50 +0000, 2014-03-18 18:05:49 +0000)
**
finished at 2014-03-18T18:06:33Z
JOINの中にSELECTを書くのをやめてみる
select b.data_user_id
from activity b
JOIN activity c
ON b.data_user_id = c.data_user_id
where b.data_type = 'signup'
and c.data_type = 'premium_registration_complete'
and TD_TIME_RANGE(b.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
and TD_TIME_RANGE(c.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
;
started at 2014-03-18T18:13:40Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1362281004.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T18:14:52Z
うーむ。JOINがだめなのかな。
これをMAPJOINにすると…
select /*+ MAPJOIN(c) */ b.data_user_id
from activity b
JOIN activity c
ON b.data_user_id = c.data_user_id
where b.data_type = 'signup'
and c.data_type = 'premium_registration_complete'
and TD_TIME_RANGE(b.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
and TD_TIME_RANGE(c.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
;
started at 2014-03-18T18:17:02Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1019712899.txt
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/4080/4080_20140318181717_7ccdbb60-39da-4c24-9a74-5cd8c8c852d4.log
2014-03-18 06:17:13 Starting to launch local task to process map join; maximum memory = 763363328
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
2014-03-18 06:17:31 Processing rows: 1560 Hashtable size: 1560 Memory usage: 17915440 rate: 0.023
2014-03-18 06:17:31 Dump the hashtable into file: file:/tmp/4080/hive_2014-03-18_18-17-11_312_384746705393357863/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable
2014-03-18 06:17:31 Upload 1 File to: file:/tmp/4080/hive_2014-03-18_18-17-11_312_384746705393357863/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable File size: 123562
2014-03-18 06:17:31 End of local task; Time Taken: 17.905 sec.
**
** Time indices:
** Time index: [2014-03-11 18:17:00 +0000, 2014-03-18 18:16:59 +0000)
**
finished at 2014-03-18T18:18:06Z
なんか変わった。bのほうだけインデックスが使われてるみたい。
TD_TIME_ADDをやめてみる
select data_user_id from activity b
LEFT SEMI JOIN
(select data_user_id from activity
where TD_TIME_RANGE(time, '2013-03-17 00:00:00', '2013-03-18 00:00:00', 'JST')
and data_type = 'premium_registration_complete'
) c
ON b.data_user_id = c.data_user_id
where TD_TIME_RANGE(time, '2013-03-11 00:00:00', '2013-03-18 00:00:00', 'JST')
and data_type = 'signup'
;
started at 2014-03-18T18:31:00Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__361703617.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T18:32:07Z
変わらない。