More than 5 years have passed since last update.

WARNING: time index filtering is not set!

hive

Last updated at 2014-03-18Posted at 2014-03-18

今週1週間に新規登録して昨日コンバージョン達成した人の、今週1週間のアクティビティを出したい。

select * from
activity a
LEFT SEMI JOIN
  (select data_user_id from activity b 
  LEFT SEMI JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'signup'
  ) d
ON a.data_user_id = d.data_user_id
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
;

かなり時間かかる。

サブクエリだけで実行

select data_user_id from activity b 
  LEFT SEMI JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'signup'
;

started at 2014-03-18T17:50:50Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1622188538.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T17:52:43Z

処理は終わるけど警告は残ったまま。

MAPJOIN

上のケースは1日以内にプレミアム登録した人という小さなテーブルとJOINしているので、MAPJOINを使ってみる。

select /*+ MAPJOIN(c) */ b.data_user_id from activity b 
  JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'signup'
;

started at 2014-03-18T17:58:48Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__243448987.txt
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/4080/4080_20140318175858_9bf02d3e-4672-46bb-9bdf-e2f4a6edef14.log
2014-03-18 05:58:59	Starting to launch local task to process map join;	maximum memory = 763363328
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
2014-03-18 05:59:34	Processing rows:	1556	Hashtable size:	1556	Memory usage:	30751296	rate:	0.04
2014-03-18 05:59:34	Dump the hashtable into file: file:/tmp/4080/hive_2014-03-18_17-58-56_440_7019263692592835974/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable
2014-03-18 05:59:35	Upload 1 File to: file:/tmp/4080/hive_2014-03-18_17-58-56_440_7019263692592835974/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable File size: 60967
2014-03-18 05:59:35	End of local task; Time Taken: 36.034 sec.
**
** Time indices:
**    Time index: [2014-03-11 17:58:48 +0000, 2014-03-18 17:58:47 +0000)
**
finished at 2014-03-18T18:00:40Z

まだ警告出てるけど、クエリが終わらないということはなくなった。

更にサブクエリだけで実行

select data_user_id from activity
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'premium_registration_complete'
;

これだときちんとtimeインデックスを使ってくれてるみたい。

started at 2014-03-18T18:05:52Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1081435809.txt
**
** Time indices:
**    Time index: [2014-03-17 18:05:50 +0000, 2014-03-18 18:05:49 +0000)
**
finished at 2014-03-18T18:06:33Z

JOINの中にSELECTを書くのをやめてみる

select b.data_user_id
from activity b 
JOIN activity c
  ON b.data_user_id = c.data_user_id
  where b.data_type = 'signup'
  and c.data_type = 'premium_registration_complete'
  and TD_TIME_RANGE(b.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and TD_TIME_RANGE(c.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
;

started at 2014-03-18T18:13:40Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1362281004.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T18:14:52Z

うーむ。JOINがだめなのかな。

これをMAPJOINにすると…

select /*+ MAPJOIN(c) */ b.data_user_id
from activity b 
JOIN activity c
  ON b.data_user_id = c.data_user_id
  where b.data_type = 'signup'
  and c.data_type = 'premium_registration_complete'
  and TD_TIME_RANGE(b.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and TD_TIME_RANGE(c.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
;

started at 2014-03-18T18:17:02Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1019712899.txt
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/4080/4080_20140318181717_7ccdbb60-39da-4c24-9a74-5cd8c8c852d4.log
2014-03-18 06:17:13	Starting to launch local task to process map join;	maximum memory = 763363328
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
2014-03-18 06:17:31	Processing rows:	1560	Hashtable size:	1560	Memory usage:	17915440	rate:	0.023
2014-03-18 06:17:31	Dump the hashtable into file: file:/tmp/4080/hive_2014-03-18_18-17-11_312_384746705393357863/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable
2014-03-18 06:17:31	Upload 1 File to: file:/tmp/4080/hive_2014-03-18_18-17-11_312_384746705393357863/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable File size: 123562
2014-03-18 06:17:31	End of local task; Time Taken: 17.905 sec.
**
** Time indices:
**    Time index: [2014-03-11 18:17:00 +0000, 2014-03-18 18:16:59 +0000)
**
finished at 2014-03-18T18:18:06Z

なんか変わった。bのほうだけインデックスが使われてるみたい。

TD_TIME_ADDをやめてみる

select data_user_id from activity b 
  LEFT SEMI JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, '2013-03-17 00:00:00', '2013-03-18 00:00:00', 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, '2013-03-11 00:00:00', '2013-03-18 00:00:00', 'JST')
  and data_type = 'signup'
;

started at 2014-03-18T18:31:00Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__361703617.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T18:32:07Z

変わらない。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up