1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

WARNING: time index filtering is not set!

Last updated at Posted at 2014-03-18

今週1週間に新規登録して昨日コンバージョン達成した人の、今週1週間のアクティビティを出したい。

select * from
activity a
LEFT SEMI JOIN
  (select data_user_id from activity b 
  LEFT SEMI JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'signup'
  ) d
ON a.data_user_id = d.data_user_id
WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
;

かなり時間かかる。

サブクエリだけで実行

select data_user_id from activity b 
  LEFT SEMI JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'signup'
;
started at 2014-03-18T17:50:50Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1622188538.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T17:52:43Z

処理は終わるけど警告は残ったまま。

MAPJOIN

上のケースは1日以内にプレミアム登録した人という小さなテーブルとJOINしているので、MAPJOINを使ってみる。

select /*+ MAPJOIN(c) */ b.data_user_id from activity b 
  JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'signup'
;
started at 2014-03-18T17:58:48Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__243448987.txt
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/4080/4080_20140318175858_9bf02d3e-4672-46bb-9bdf-e2f4a6edef14.log
2014-03-18 05:58:59	Starting to launch local task to process map join;	maximum memory = 763363328
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
2014-03-18 05:59:34	Processing rows:	1556	Hashtable size:	1556	Memory usage:	30751296	rate:	0.04
2014-03-18 05:59:34	Dump the hashtable into file: file:/tmp/4080/hive_2014-03-18_17-58-56_440_7019263692592835974/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable
2014-03-18 05:59:35	Upload 1 File to: file:/tmp/4080/hive_2014-03-18_17-58-56_440_7019263692592835974/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable File size: 60967
2014-03-18 05:59:35	End of local task; Time Taken: 36.034 sec.
**
** Time indices:
**    Time index: [2014-03-11 17:58:48 +0000, 2014-03-18 17:58:47 +0000)
**
finished at 2014-03-18T18:00:40Z

まだ警告出てるけど、クエリが終わらないということはなくなった。

更にサブクエリだけで実行

select data_user_id from activity
  where TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
  and data_type = 'premium_registration_complete'
;

これだときちんとtimeインデックスを使ってくれてるみたい。

started at 2014-03-18T18:05:52Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1081435809.txt
**
** Time indices:
**    Time index: [2014-03-17 18:05:50 +0000, 2014-03-18 18:05:49 +0000)
**
finished at 2014-03-18T18:06:33Z

JOINの中にSELECTを書くのをやめてみる

select b.data_user_id
from activity b 
JOIN activity c
  ON b.data_user_id = c.data_user_id
  where b.data_type = 'signup'
  and c.data_type = 'premium_registration_complete'
  and TD_TIME_RANGE(b.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and TD_TIME_RANGE(c.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
;
started at 2014-03-18T18:13:40Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1362281004.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T18:14:52Z

うーむ。JOINがだめなのかな。

これをMAPJOINにすると…

select /*+ MAPJOIN(c) */ b.data_user_id
from activity b 
JOIN activity c
  ON b.data_user_id = c.data_user_id
  where b.data_type = 'signup'
  and c.data_type = 'premium_registration_complete'
  and TD_TIME_RANGE(b.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-7d'), TD_SCHEDULED_TIME(), 'JST')
  and TD_TIME_RANGE(c.time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME(), 'JST')
;
started at 2014-03-18T18:17:02Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__1019712899.txt
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/4080/4080_20140318181717_7ccdbb60-39da-4c24-9a74-5cd8c8c852d4.log
2014-03-18 06:17:13	Starting to launch local task to process map join;	maximum memory = 763363328
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
2014-03-18 06:17:31	Processing rows:	1560	Hashtable size:	1560	Memory usage:	17915440	rate:	0.023
2014-03-18 06:17:31	Dump the hashtable into file: file:/tmp/4080/hive_2014-03-18_18-17-11_312_384746705393357863/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable
2014-03-18 06:17:31	Upload 1 File to: file:/tmp/4080/hive_2014-03-18_18-17-11_312_384746705393357863/-local-10002/HashTable-Stage-1/MapJoin-c-01--.hashtable File size: 123562
2014-03-18 06:17:31	End of local task; Time Taken: 17.905 sec.
**
** Time indices:
**    Time index: [2014-03-11 18:17:00 +0000, 2014-03-18 18:16:59 +0000)
**
finished at 2014-03-18T18:18:06Z

なんか変わった。bのほうだけインデックスが使われてるみたい。

TD_TIME_ADDをやめてみる

select data_user_id from activity b 
  LEFT SEMI JOIN
    (select data_user_id from activity
      where TD_TIME_RANGE(time, '2013-03-17 00:00:00', '2013-03-18 00:00:00', 'JST')
      and data_type = 'premium_registration_complete'
    ) c
  ON b.data_user_id = c.data_user_id
  where TD_TIME_RANGE(time, '2013-03-11 00:00:00', '2013-03-18 00:00:00', 'JST')
  and data_type = 'signup'
;
started at 2014-03-18T18:31:00Z
Hive history file=/mnt/hive/tmp/4080/hive_job_log__361703617.txt
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
finished at 2014-03-18T18:32:07Z

変わらない。

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?