これは何
SQLテーブルにてユニークな日付の入ったカラムから、連続した日を抽出しその開始と終了の両端の日を取得したい。そのための作業メモ。
方針
- 自己結合
- Window関数
先に結論
SQL文のわかりやすさで自己結合一択。データ量が多い場合はWindow関数の方がスキャンが1回で済むがそもそも日付連続するデータは有史以来多くないためその恩恵に意味がない。
以後 bigquery 上で実行できる(ユニーク、ソート済みのデータから)サンプルを書く。
インプット
行 _date
1 2021-01-01
2 2021-01-02
3 2021-01-04
4 2021-01-07
5 2021-01-08
6 2021-01-09
7 2021-01-15
8 2021-01-16
9 2021-01-20
方針1
- pros: 自己結合し
d1._date - COUNT(d2._date)
で連続する対象を同一グループをするところがテクニカル。そこ以外は素直なSQL文でわかりやすい - cons: 自己結合なのでスキャンが2回実行される
WITH d AS (
SELECT * FROM
UNNEST(ARRAY<DATE>[
"2021-01-01", "2021-01-02",
"2021-01-04",
"2021-01-07", "2021-01-08", "2021-01-09",
"2021-01-15", "2021-01-16",
"2021-01-20"
] ) AS _date ),
o1 AS (
SELECT
d1._date, d1._date - COUNT(d2._date) AS grp
FROM d AS d1 INNER JOIN d AS d2 ON
d1._date >= d2._date
GROUP BY
d1._date ),
out1 AS (
SELECT
MIN(_date) AS _from, MAX(_date) AS _to
FROM
o1
GROUP BY
grp )
SELECT * FROM out1
結果
行 low high
1 2021-01-01 2021-01-02
2 2021-01-04 2021-01-04
3 2021-01-07 2021-01-09
4 2021-01-15 2021-01-16
5 2021-01-20 2021-01-20
方針2
pros: あるのか・・・?SQLパズル力が高められるぐらい(実践には不要)
cons: SQL文が複雑で解読しずらい
WITH d AS (
SELECT * FROM
UNNEST(ARRAY<DATE>[
"2021-01-01", "2021-01-02",
"2021-01-04",
"2021-01-07", "2021-01-08", "2021-01-09",
"2021-01-15", "2021-01-16",
"2021-01-20"
] ) AS _date ),
o1 AS (
SELECT
_date,
LAG(_date) OVER(ORDER BY _date) AS prev_diff,
LEAD(_date) OVER(ORDER BY _date) AS next_diff,
FROM d
),
o2 AS (
SELECT
_date,
CASE WHEN COALESCE(EXTRACT(DAY FROM _date - prev_diff), 0) <> 1 THEN _date ELSE NULL END AS low,
CASE WHEN COALESCE(EXTRACT(DAY FROM next_diff - _date), 0) <> 1 THEN _date ELSE NULL END AS high
FROM o1
),
o3 AS (
SELECT
low,
CASE
WHEN high IS NULL THEN min(high) OVER(ORDER BY _date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
ELSE high
END AS high
FROM o2
),
out2 as (
SELECT * FROM o3 WHERE low IS NOT NULL
)
SELECT * FROM out2
結果: 同上のため略
参考文献
SQL実践入門 第8章