0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Visualize SQL | Merging Time Periods(part 1)

Posted at

Beginning

This technique is based on SQL Puzzle and Answers # 59 MERGING TIME PERIODS
It is better with 『SQL Puzzle and Answers』 to read this article.

It explains the idea of algorithm and
You can check what each puzzle is intended for ・・・
But there is almost no commentary about this puzzle

This puzzle handles time, but it's not easy.
I mean you might feel dizzy to read SQL even it's not so long

What we are going to do is to find
contiguous or overlapping time periods from job table

Let's start !!

See blue lines as job time
There are 10 jobs(task_id)
 ⇒ task_id 1 : Starts at 1 and ends at 3
 ⇒ task_id 2 : Starts at 2 and ends at 4
 ⇒ task_id 3 : Starts at 4 and ends at 5
 ⇒ task_id 4 : Starts at 6 and ends at 9
 ⇒ task_id 5 : Starts at 9 and ends at 9
    ・ ・ ・ ・ ・
    ・ ・ ・ ・ ・
 ⇒ task_id 10 : Starts at 17 and ends at 17

『contiguous or overlapping time periods』 will be orange lines on the bottom
We are going to get start and end date of these orange lines by SQL

re_1.jpg

I have checked that it works with PostgreSQL

Table and data

SQL
CREATE TABLE Timesheets(
	task_id CHAR(10) NOT NULL PRIMARY KEY,
	start_date DATE NOT NULL,
	end_date DATE NOT NULL,
	CHECK(start_date <= end_date)
);

INSERT INTO Timesheets
VALUES (1, '1997-01-01', '1997-01-03'),
       (2, '1997-01-02', '1997-01-04'),
       (3, '1997-01-04', '1997-01-05'),
       (4, '1997-01-06', '1997-01-09'),
       (5, '1997-01-09', '1997-01-09'),
       (6, '1997-01-09', '1997-01-09'),
       (7, '1997-01-12', '1997-01-15'),
       (8, '1997-01-13', '1997-01-14'),
       (9, '1997-01-14', '1997-01-14'),
       (10, '1997-01-17', '1997-01-17');

▼ Output(left side of image)

The image used above is displayed by side.
The blue line numbers are the same as the "day" of
start_date and end_date.

re_2.jpg

Creating Original Data ①

SQL
SELECT T1.*, T2.*
FROM Timesheets AS T1, Timesheets AS T2
WHERE T1.start_date <= T2.end_date

Creating original data to take out 『contiguous or overlapping time periods』
from CROSS JOIN Timesheets table.
 ⇒ Taking out 『start date』from T1 and 『end date』from T2
 ⇒ Add a condition of T1.start_date <= T2.end_date
   because 『end date』should be later than 『start date』
 ⇒ Comparison operator is『<=』to cover jos which start_date and
   end_date are the same like "task_id 6"
 ⇒ Combinations that fit this condition are left

▼ Output(Original Data ①)

re_3.jpg

This is the original data to take out『start date』and『end date』of
『contiguous or overlapping time periods』
 ⇒ the orange lines on the bottom

re_4.jpg

Creating Original Data ②

SQL
SELECT T3.*, T4.*
FROM Timesheets AS T3, Timesheets AS T4
WHERE T3.end_date < T4.start_date

Creating original data using CROSS JOIN Timesheets table
but a condition is not the same as 『Original Data ①』
 ⇒ T3.end_date < T4.start_date
 ⇒ 『start date』of T4 is later than『end date』 of T3
   Combinations of T3 and T4 do not overlap

▼ image of 『Original Data ②』

re_5.jpg

▼ Output(Original Data ②)

re_6.jpg


Original Data ① : a set of times that have『start date』and『end date』
Original Data ② : a set of non-overlapping times

Our goal is to find 『start date』and『end date』 of
『contiguous or overlapping time periods』
So what we are going to do is ・・・
 [1] Exclude 『Original Data ②』from『Original Data ①』
 [2] Find start and end dates from the rest of the data

▼ Output(Original Data ① & Original Data ②)

re_7.JPG

Edit Original Data ①

SQL
SELECT T1.*, T2.*
FROM Timesheets AS T1, Timesheets AS T2
WHERE T1.start_date <= T2.end_date
AND NOT EXISTS
(
   SELECT *
   FROM Timesheets AS T3, Timesheets AS T4
   WHERE
     T3.end_date < T4.start_date AND
     T1.start_date <= T3.start_date AND T3.end_date <= T2.end_date
     AND
     T1.start_date <= T4.start_date AND T4.end_date <= T2.end_date
)

We want to exclude『Original Data ②』from『Original Data ①』.
However『T3 & T4』and『T1 & T2』are different Timesheets table data.
So they cannot be excluded as they are.

Add『NOT EXISTS』to the WHERE clause of『Original Data ①』.
『Original data ②』is in the『NOT EXISTS』but there are other conditions
 ⇒ T1.start_date <= T3.start_date AND T3.end_date <= T2.end_date
   AND
   T1.start_date <= T4.start_date AND T4.end_date <= T2.end_date
 ⇒ This defines『T3 & T4』are a set of times
   that are inside T1.start_date and T2.end_date
 ⇒ If 『T3 & T4』 which are inside T1.start_date and T2.end_date
   fit condition of 『T3.end_date < T4.start_date』
   they are excluded(NOT EXISTS)
 ⇒ 『contiguous or overlapping time periods』are left in『T1 & T2』

▼ (image of how this SQL works)

re_8.JPG


▼ Output

『T1 & T2』after removing the non-overlapping time
 ⇒ Edit this output then take out『start_date』and『end_date』

re_9.jpg

Edit Original Data ②

SQL
SELECT T1.start_date, MIN(T2.start_date), MAX(T2.end_date)
FROM Timesheets AS T1, Timesheets AS T2
WHERE T1.start_date <= T2.end_date
AND NOT EXISTS
(
  SELECT *
  FROM Timesheets AS T3, Timesheets AS T4
  WHERE
    T3.end_date < T4.start_date AND
    T3.start_date >= T1.start_date AND T3.end_date <= T2.end_date
    AND
    T4.start_date >= T1.start_date AND T4.end_date <= T2.end_date
)
GROUP BY T1.start_date

T1.start_date become unique because of『GROUP BY T1.start_date』
 ⇒ Some of these unique would be『start date』

MAX(T2.end_date) is the latest T2.end_date among the combinations
that T1.star_date has.
 ⇒ Some of these unique would be『end date』

MIN(T2.start_date) is the earliest T2.start_date among the combinations
that T1.star_date has.
 ⇒ Find 『start date』 using these

▼ Output(right side of image)

re_10.jpg

Edit Original Data ③

SQL
SELECT T1.start_date, MIN(T2.start_date), MAX(T2.end_date)
FROM Timesheets AS T1, Timesheets AS T2
WHERE T1.start_date <= T2.end_date
AND NOT EXISTS
(
  SELECT *
  FROM Timesheets AS T3, Timesheets AS T4
  WHERE
    T3.end_date < T4.start_date AND
    T3.start_date >= T1.start_date AND T3.end_date <= T2.end_date
    AND
    T4.start_date >= T1.start_date AND T4.end_date <= T2.end_date
)
GROUP BY T1.start_date
HAVING T1.start_date = MIN(T2.start_date)  -- Add

▼ Output(bottom of image)

re_11.jpg

Add T1.start_date = MIN(T2.start_date)
 ⇒ 『start_date』that we want to pick fits this condition
 ⇒ pick『start_date』come with 『end_date』
 ⇒ Why・・・? We will check in detail

▼ ie:Why T1.start_date(1997-01-02) can not be『start date』?

re_12.jpg

▼ ie:Why T1.start_date(1997-01-14) can not be『start date』?

re_13.jpg

▼ ie:Why T1.start_date(1997-01-12) can be『start date』?

re_14.jpg

▼ ie:Why T1.start_date(1997-01-06) can be『start date』?

re_15.jpg

This SQL has a missing condition

We've checked this SQL so far ・・・
『end date』cannot be taken correctly with this SQL

Fix the problem this SQL has and check why next

start date(T1.start_date) end date(T2.end_date) correct end date
1997-01-01 1997-01-04 1997-01-05
1997-01-06 1997-01-09 ← OK
1997-01-12 1997-01-15 ← OK
1997-01-17 1997-01-17 ← OK

re_16.jpg

Modify 『Edit Original Data①』

SQL
SELECT T1.*, T2.*
FROM Timesheets AS T1, Timesheets AS T2
WHERE T1.start_date <= T2.end_date
AND NOT EXISTS
(
   SELECT *
   FROM Timesheets AS T3, Timesheets AS T4
   WHERE
     T3.end_date < T4.start_date AND
     T1.start_date <= T3.start_date AND T3.end_date <= T2.end_date
     AND
     T1.start_date <= T4.start_date AND T4.end_date <= T2.end_date
     AND --- Add from here ↓ ↓ ↓ ↓ ↓ ↓
     NOT EXISTS(
       SELECT * FROM Timesheets AS T5
       WHERE
         T5.start_date BETWEEN T3.start_date AND T3.end_date
         AND
         T5.end_date BETWEEN T4.start_date AND T4.end_date
     )
)

▼ Output(right side of image)
Add another 『NOT EXISTS』 inside of 『NOT EXISTS』
then add conditions to 『T3 & T4』 using a copy of Timesheets(T5)
 ⇒ This will output one more row

re_17.jpg

This condition is ・・・

Although『T3 & T4』are used in the outer "NOT EXISTS"
to exclude non-overlapping combinations,
leaving required combinations using the inner "NOT EXISTS"

leaving required combinations・・・means
If 『start date』and『end date』 in Timesheets(T5) table are
in the non-overlapping combinations defined using 『T3 & T4』,
they shuould not be in the non-overlapping combinations.
That is why they are left

▼ image of this process

re_18.jpg


Let's see in detail
Get the removed time combination with 『NOT EXISTS』
 ⇒ Change 『NOT EXISTS』to 『EXISTS』then show removed combinations
 ⇒ It's the green dotted frame

re_19.jpg


Whether『start date』and『end date』in Timesheets(T5) table are
in the removed time combination or not?
Pick some 『start_date』and 『end_date』in Timesheets(T5)
then check the condition
 ⇒ 『task_id is 2』fits the condition
    start_date(1997-01-02) and end_date(1997-01-04)
 ⇒ the red frame row is left
 ⇒ the rest are excluded

check below images

re_20.jpg

re_21.jpg

outer and inner NOT EXISTS are pair

Since we want to find『contiguous or overlapping time periods』
it makes sence to exclude time combinations where 『T3.end_date < T4.start_date』
but it is lack of consideration

In this case ・・・
when end date of 『job①』and star date of 『job②』do not overlap
we can see they are not 『contiguous or overlapping』

If 『job③』is inside of『job①』and 『job②』
we can see『job①』『job③』and『job②』are『contiguous or overlapping』

These process is done with the outer NOT EXISTS and
the inner NOT EXISTS

▼ below is the image of how this works

re_22.jpg

The output of this SQL is the original data
for retrieving『start date』and『end date』

We are going to do the same process using this output

Edit Original Data ④

SQL
SELECT T1.start_date, MIN(T2.start_date), MAX(T2.end_date)
FROM Timesheets AS T1, Timesheets AS T2
WHERE T1.start_date <= T2.end_date
AND NOT EXISTS
(
  SELECT *
  FROM Timesheets AS T3, Timesheets AS T4
  WHERE
    T3.end_date < T4.start_date AND
    T3.start_date >= T1.start_date AND T3.end_date <= T2.end_date
    AND
    T4.start_date >= T1.start_date AND T4.end_date <= T2.end_date
    AND NOT EXISTS
    (
       SELECT * FROM Timesheets AS T5
       WHERE T5.start_date BETWEEN T3.start_date AND T3.end_date
         AND T5.end_date BETWEEN T4.start_date AND T4.end_date
    )
)
GROUP BY T1.start_date
--HAVING T1.start_date = MIN(T2.start_date)

▼ Output(right side of image)

I explained before ・・・please check there

re_23.jpg

Answer SQL

SQL
SELECT T1.start_date, MIN(T2.start_date), MAX(T2.end_date)
FROM Timesheets AS T1, Timesheets AS T2
WHERE T1.start_date <= T2.end_date
AND NOT EXISTS
(
  SELECT *
  FROM Timesheets AS T3, Timesheets AS T4
  WHERE
    T3.end_date < T4.start_date AND
    T3.start_date >= T1.start_date AND T3.end_date <= T2.end_date
    AND
    T4.start_date >= T1.start_date AND T4.end_date <= T2.end_date
    AND NOT EXISTS
    (
       SELECT * FROM Timesheets AS T5
       WHERE T5.start_date BETWEEN T3.start_date AND T3.end_date
         AND T5.end_date BETWEEN T4.start_date AND T4.end_date
    )
)
GROUP BY T1.start_date
HAVING T1.start_date = MIN(T2.start_date)

▼ Output

I explained before ・・・please check there

re_24.jpg

Finally, we can get 『start date』and『end date』 of
『contiguous or overlapping time periods』

start date(T1.start_date) end date MAX(T2.end_date)
1997-01-01 1997-01-05
1997-01-06 1997-01-09
1997-01-12 1997-01-15
1997-01-17 1997-01-17

It's been a little long ・・・
I explained this puzzle with incomplete SQL on purpose at the beginning
because I thought it would help to understand how to handle data with SQL

I used many pictures and visualized date flow to explain
but still some explanations may be difficult to catch

Because my words are poor ・・・ or
takes some time to get used to understand this kind of SQL

This is a good SQL puzzle for treating data as a set

References

Joe Celko's SQL Puzzles and Answers

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?