LoginSignup
22
8

More than 5 years have passed since last update.

SQL再入門

Last updated at Posted at 2016-12-09
1 / 27

@Opt Technologies 社内勉強会 2016/12/09

こちらは、社内勉強会で使った資料となります。

近況

  • BigQueryすごい

皆さんSQL好きですか?

よくある声

  • DBMSによって方言があってつらい
  • コード書かせろ

でも時代はSQL

  • BigQuery / Redshift等、SQLをクエリ言語とする分散処理システムの台頭
  • Sparkも2.0でSQL2003をサポートするなど、SQLの機能が強化
  • 逃げられない

今日やること

  1. JOIN再入門
  2. Window Functions
  3. Common Table Expressions
  4. Example
    • Sessionization

実行計画についてはふれません


環境

  • Postgresql 9.6.1

1. JOIN再入門

create table account(
  id integer not null,
  name varchar(100) not null,
  email varchar(100) not null
);

create table post(
  account_id integer not null,
  message text
);

insert into account values(0, 'john', 'john@example.com');
insert into account values(1, 'clifford', 'clifford@example.com');

insert into post values(0, 'hello');
insert into post values(0, 'hi');

insert into post values(2, 'no');

CROSS

行の組み合わせ全部

select * from account cross join post;
 id |   name   |        email         | account_id | message
----+----------+----------------------+------------+---------
  0 | john     | john@example.com     |          0 | hello
  1 | clifford | clifford@example.com |          0 | hello
  0 | john     | john@example.com     |          0 | hi
  1 | clifford | clifford@example.com |          0 | hi
  0 | john     | john@example.com     |          2 | no
  1 | clifford | clifford@example.com |          2 | no
(6 rows)

ちなみに、INNER/OUTERは、CROSS JOIN(直積)の部分集合かどうか、から来ているそうだ。


INNER

CROSS JOINの結果を、ON句でfilterするイメージ

select * from account a inner join post p on a.id = p.account_id;
 id | name |      email       | account_id | message
----+------+------------------+------------+---------
  0 | john | john@example.com |          0 | hello
  0 | john | john@example.com |          0 | hi
(2 rows)

LEFT OUTER

inner joinの結果に加え、左側のテーブルのうち、on条件を満たさないものも残る

select * from account a left join post p on a.id = p.account_id;
 id |   name   |        email         | account_id | message
----+----------+----------------------+------------+---------
  0 | john     | john@example.com     |          0 | hello
  0 | john     | john@example.com     |          0 | hi
  1 | clifford | clifford@example.com |            |
(3 rows)

RIGHT OUTER

inner joinの結果に加え、右側のテーブルのうち、on条件を満たさないものも残る

select * from account a right join post p on a.id = p.account_id;
 id | name |      email       | account_id | message
----+------+------------------+------------+---------
  0 | john | john@example.com |          0 | hello
  0 | john | john@example.com |          0 | hi
    |      |                  |          2 | no
(3 rows)

FULL OUTER

inner joinの結果に加え、右側 & 左側のテーブルのうちon条件を満たさないものも残る

select * from account a full join post p on a.id = p.account_id;
 id |   name   |        email         | account_id | message
----+----------+----------------------+------------+---------
  0 | john     | john@example.com     |          0 | hello
  0 | john     | john@example.com     |          0 | hi
    |          |                      |          2 | no
  1 | clifford | clifford@example.com |            |
(4 rows)

2. Window Functions

  • 集約しないGROUP BYみたいなノリ
    • GROUP BYと違って結果の行数は変わらない
  • SELECT句に書く
  • いろんな関数があって便利

先のpostテーブルには今こんな行が入っている

select * from post;
 account_id | message
------------+---------
          0 | hello
          0 | hi
          2 | no
(3 rows)

account_idごとのpost数 (GROUP BY)

select account_id, count(*) from post group by account_id;
 account_id | count
------------+-------
          0 |     2
          2 |     1
(2 rows)

messageも含めたaccount_idごとのpost数(Window関数)

select account_id, message, count(*) over (partition by account_id) from post;
 account_id | message | count
------------+---------+-------
          0 | hello   |     2
          0 | hi      |     2
          2 | no      |     1
(3 rows)

Windowには名前をつけられる

いくつかの列で同じWindowを使うとき便利

select account_id, message, count(*) over w
from post window w as (partition by account_id);

3. Common Table Expressions

  • 共通テーブル表式
  • 一時テーブル的なもの
  • サブクエリと似てるとも言える
with account_with_post(id, name, email, message) as
(
  select u.id, u.name, u.email, p.message
  from account u left join post p on account.id = post.account_id
)
select * from account_with_post;

再帰CTE

create table linked_list(
  id integer not null,
  next_id integer,
  value varchar(100)
);

-- 'H' -> 'e' -> 'l' -> 'l' -> 'o'
insert into linked_list values(0, 2, 'H');
insert into linked_list values(2, 4, 'e');
insert into linked_list values(4, 6, 'l');
insert into linked_list values(6, 8, 'l');
insert into linked_list values(8, null, 'o');

-- 's' -> 'q' -> 'l'
insert into linked_list values(1, 3, 's');
insert into linked_list values(3, 5, 'q');
insert into linked_list values(5, null, 'l');

構文

  • 以下の2つをUNION ALLしたものを、CTEの中身に書く
    1. スタートを表すselect(再帰を含まない)
    2. 再帰を表すselect
  • 1.から初めて再帰的に2.が行を返す
  • 2.が行を返さなくなったら終了
with recursive traverse(last_id, current_id, path) as
(
  select id, id, value from linked_list where next_id is null
  union all
  select t.last_id, l.id, (l.value || '->' || t.path)::varchar(100)
  from traverse t inner join linked_list l on l.next_id = t.current_id
)
select * from traverse;
 last_id | current_id |     path
---------+------------+---------------
       8 |          8 | o
       5 |          5 | l
       8 |          6 | l->o
       5 |          3 | q->l
       8 |          4 | l->l->o
       5 |          1 | s->q->l
       8 |          2 | e->l->l->o
       8 |          0 | H->e->l->l->o
(8 rows)

last_idごとにまとめれば、linked_listに含まれる連結リスト全体がわかる。

with recursive traverse(last_id, current_id, path) as
(
  select id, id, value from linked_list where next_id is null
  union all
  select t.last_id, l.id, (l.value || '->' || t.path)::varchar(100)
  from traverse t inner join linked_list l on l.next_id = t.current_id
)
select path from
  (select
    path,
    row_number() over (partition by last_id order by length(path) desc) as rn
  from traverse) l
where l.rn = 1;
         path
-----------------------
 s->q->l
 H->e->l->l->o
(2 rows)

4. Example

Sessionization

アクセス解析では、広告クリックや検索エンジンからのサイト流入 ~> サイト内を回遊 ~> 離脱の流れをセッションという単位で呼び、セッション単位で集計することがよくある。

ここでは、サイトのPVログがpage_viewテーブル、サイトをLPとした広告のクリックログがclickテーブルに入っているとき、これらをセッションに分割してsessionテーブルに格納することを考える。

なお、このSessionizationの例はこちらの記事をほぼそのまま参考にさせてもらってます。。

create table page_view(
  uid varchar(100) not null, -- ブラウザに割り振られるcookie idを表す
  epochmillis bigint not null,
  referer_domain varchar(1024)
);

create table click(
  uid varchar(100) not null,
  epochmillis bigint not null
);

create table session(
  uid varchar(100) not null,
  epochmillis bigint not null,
  page_views integer not null,
  start_by_click boolean not null
);

ここでのセッションの定義

  • セッションとは、page_viewclickの集まりで、以下を満たすもの
    • clickはセッションの始まりである(サイトへの流入と同義なため)
    • 前回のpage_view or clickから30分時間が空いたら、セッションが切れる(離脱)
    • page_viewreferer_domainexample.com(アクセス解析対象サイトのドメインと思ってください)でない場合、そのpage_viewからセッションが始まる(外部サイトからの流入を表すため)

page_view, clickログは以下のようなものが入ってるとする。

-- session 1
insert into click values('a', 0);
insert into page_view values('a', 1, 'example.com');
insert into page_view values('a', 2, 'example.com');

-- session 2

-- 30分経った
insert into page_view values('a', 1800003, 'example.com');
insert into page_view values('a', 1800004, 'example.com');

-- session 3

-- 外部ドメインからの流入
insert into page_view values('a', 1800005, 'google.com');
insert into page_view values('a', 1800006, 'example.com');
insert into page_view values('a', 1800007, 'example.com');

完成系

-- 1. clickとpage_viewをまとめる
with event(uid, epochmillis, referer_domain, event_type) as
(
  select uid, epochmillis, referer_domain, 'page_view' from page_view
  union all
  select uid, epochmillis, null, 'click' from click

-- 2. 各行に、uid単位の時系列で一つ前の行の情報をくっつける
), event_with_previous(uid, epochmillis, referer_domain, event_type, prev_epochmillis) as
(select
  uid,
  epochmillis,
  referer_domain,
  event_type,
  lag(epochmillis) over (partition by uid order by epochmillis rows between 1 preceding and current row)
from event

-- 3. 各行がsessionの開始かどうか判定し、それぞれに対しsession_idをふっていく
), event_with_session_id(uid, epochmillis, event_type, session_id) as
(select
  uid,
  epochmillis,
  event_type,
  sum(case
    when prev_epochmillis is null then 1
    when epochmillis - prev_epochmillis >= 30 * 60 * 1000 then 1
    when event_type = 'click' then 1
    when referer_domain <> 'example.com' then 1
    else 0
  end) over (partition by uid order by epochmillis) as session_id
from event_with_previous)

-- 4. 各セッションに含まれるpv数を集計し、セッションを出力
select uid, epochmillis, page_views, event_type = 'click' as start_by_click from
  (select
    uid,
    epochmillis,
    event_type,
    session_id,
    row_number() over (partition by uid, session_id order by epochmillis) as rn,
    sum(case when event_type = 'page_view' then 1 else 0 end) over (partition by uid, session_id) as page_views
  from event_with_session_id
) s where s.rn = 1;
 uid | epochmillis | page_views | start_by_click
-----+-------------+------------+----------------
 a   |           0 |          2 | t
 a   |     1800003 |          2 | f
 a   |     1800005 |          3 | f
(3 rows)

それぞれのパートを見ていく

1
select uid, epochmillis, referer_domain, 'page_view' from page_view
union all
select uid, epochmillis, null, 'click' from click;
 uid | epochmillis | referer_domain | ?column?
-----+-------------+----------------+-----------
 a   |           1 | example.com    | page_view
 a   |           2 | example.com    | page_view
 a   |     1800003 | example.com    | page_view
 a   |     1800004 | example.com    | page_view
 a   |     1800005 | google.com     | page_view
 a   |     1800006 | example.com    | page_view
 a   |     1800007 | example.com    | page_view
 a   |           0 |                | click
(8 rows)

2
-- 1. clickとpage_viewをまとめる
with event(uid, epochmillis, referer_domain, event_type) as
(
  select uid, epochmillis, referer_domain, 'page_view' from page_view
  union all
  select uid, epochmillis, null, 'click' from click
)
select
  uid,
  epochmillis,
  referer_domain,
  event_type,
  lag(epochmillis) over (partition by uid order by epochmillis rows between 1 preceding and current row)
from event;
 uid | epochmillis | referer_domain | event_type |   lag
-----+-------------+----------------+------------+---------
 a   |           0 |                | click      |
 a   |           1 | example.com    | page_view  |       0
 a   |           2 | example.com    | page_view  |       1
 a   |     1800003 | example.com    | page_view  |       2
 a   |     1800004 | example.com    | page_view  | 1800003
 a   |     1800005 | google.com     | page_view  | 1800004
 a   |     1800006 | example.com    | page_view  | 1800005
 a   |     1800007 | example.com    | page_view  | 1800006
(8 rows)

3
-- 1. clickとpage_viewをまとめる
with event(uid, epochmillis, referer_domain, event_type) as
(
  select uid, epochmillis, referer_domain, 'page_view' from page_view
  union all
  select uid, epochmillis, null, 'click' from click

-- 2. 各行に、uid単位の時系列で一つ前の行の情報をくっつける
), event_with_previous(uid, epochmillis, referer_domain, event_type, prev_epochmillis) as
(select
  uid,
  epochmillis,
  referer_domain,
  event_type,
  lag(epochmillis) over (partition by uid order by epochmillis rows between 1 preceding and current row)
from event
)
select
  uid,
  epochmillis,
  event_type,
  sum(case
    when prev_epochmillis is null then 1
    when epochmillis - prev_epochmillis >= 30 * 60 * 1000 then 1
    when event_type = 'click' then 1
    when referer_domain <> 'example.com' then 1
    else 0
  end) over (partition by uid order by epochmillis) as session_id
from event_with_previous;
 uid | epochmillis | event_type | session_id
-----+-------------+------------+------------
 a   |           0 | click      |          1
 a   |           1 | page_view  |          1
 a   |           2 | page_view  |          1
 a   |     1800003 | page_view  |          2
 a   |     1800004 | page_view  |          2
 a   |     1800005 | page_view  |          3
 a   |     1800006 | page_view  |          3
 a   |     1800007 | page_view  |          3
(8 rows)

まとめ

  • SQLで色々解ける
  • まずSQLでなんとかできないか考えよう

参考

22
8
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
22
8