Databricksとdbt Cloudの連携のステップ4: より複雑なモデルを作成して実行するとステップ5: テストを作成して実行するを実践した内容です。ステップ3以前の内容についてはDatabricksとdbt Cloudの連携(実践編その1)をご覧ください。
テーブルの作成
ここではデータベースtaka_dbt
に以下のテーブルを作成します。
DROP TABLE IF EXISTS zzz_game_opponents;
DROP TABLE IF EXISTS zzz_game_scores;
DROP TABLE IF EXISTS zzz_games;
DROP TABLE IF EXISTS zzz_teams;
CREATE TABLE zzz_game_opponents (
game_id INT,
home_team_id INT,
visitor_team_id INT
) USING DELTA;
INSERT INTO zzz_game_opponents VALUES (1, 1, 2);
INSERT INTO zzz_game_opponents VALUES (2, 1, 3);
INSERT INTO zzz_game_opponents VALUES (3, 2, 1);
INSERT INTO zzz_game_opponents VALUES (4, 2, 3);
INSERT INTO zzz_game_opponents VALUES (5, 3, 1);
INSERT INTO zzz_game_opponents VALUES (6, 3, 2);
CREATE TABLE zzz_game_scores (
game_id INT,
home_team_score INT,
visitor_team_score INT
) USING DELTA;
INSERT INTO zzz_game_scores VALUES (1, 4, 2);
INSERT INTO zzz_game_scores VALUES (2, 0, 1);
INSERT INTO zzz_game_scores VALUES (3, 1, 2);
INSERT INTO zzz_game_scores VALUES (4, 3, 2);
INSERT INTO zzz_game_scores VALUES (5, 3, 0);
INSERT INTO zzz_game_scores VALUES (6, 3, 1);
CREATE TABLE zzz_games (
game_id INT,
game_date DATE
) USING DELTA;
INSERT INTO zzz_games VALUES (1, '2020-12-12');
INSERT INTO zzz_games VALUES (2, '2021-01-09');
INSERT INTO zzz_games VALUES (3, '2020-12-19');
INSERT INTO zzz_games VALUES (4, '2021-01-16');
INSERT INTO zzz_games VALUES (5, '2021-01-23');
INSERT INTO zzz_games VALUES (6, '2021-02-06');
CREATE TABLE zzz_teams (
team_id INT,
team_city VARCHAR(15)
) USING DELTA;
INSERT INTO zzz_teams VALUES (1, "San Francisco");
INSERT INTO zzz_teams VALUES (2, "Seattle");
INSERT INTO zzz_teams VALUES (3, "Amsterdam");
上のSQLを実行すると、このように新たにテーブルが作成されます。
モデルの定義
こちらで説明されているようにモデルを作成していきます。データベースを独自のものにしていますので、若干SQL文を変更しています。参照先のデータベースを一括で設定する方法ご存知の方いらしたら教えていただけると幸いです。
{{ config(
materialized='table',
file_format='delta'
) }}
-- Step 4 of 4: Replace the visitor team IDs with their city names.
select
game_id,
home,
t.team_city as visitor,
home_score,
visitor_score,
-- Step 3 of 4: Display the city name for each game's winner.
case
when
home_score > visitor_score
then
home
when
visitor_score > home_score
then
t.team_city
end as winner,
game_date as date
from (
-- Step 2 of 4: Replace the home team IDs with their actual city names.
select
game_id,
t.team_city as home,
home_score,
visitor_team_id,
visitor_score,
game_date
from (
-- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
select
g.game_id,
go.home_team_id,
gs.home_team_score as home_score,
go.visitor_team_id,
gs.visitor_team_score as visitor_score,
g.game_date
from
taka_dbt.zzz_games as g,
taka_dbt.zzz_game_opponents as go,
taka_dbt.zzz_game_scores as gs
where
g.game_id = go.game_id and
g.game_id = gs.game_id
) as all_ids,
taka_dbt.zzz_teams as t
where
all_ids.home_team_id = t.team_id
) as visitor_ids,
taka_dbt.zzz_teams as t
where
visitor_ids.visitor_team_id = t.team_id
order by game_date desc
{{ config(
database="taka_dbt"
) }}
-- Create a view that summarizes the season's win and loss records by team.
-- Step 2 of 2: Calculate the number of wins and losses for each team.
select
winner as team,
count(winner) as wins,
-- Each team played in 4 games.
(4 - count(winner)) as losses
from (
-- Step 1 of 2: Determine the winner and loser for each game.
select
game_id,
winner,
case
when
home = winner
then
visitor
else
home
end as loser
from taka_dbt.zzz_game_details
)
group by winner
order by wins desc
モデルの実行
画面下のRunsの右のテキストボックスに以下のコマンドを入力し、Enterをクリックして実行します。
dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql
処理が完了するとテーブルが追加されます。
テストの定義
プロジェクトにmodels/schema.yml
を追加します。以下の内容を貼り付けます。
version: 2
models:
- name: zzz_game_details
columns:
- name: game_id
tests:
- unique
- not_null
- name: home
tests:
- not_null
- accepted_values:
values: ['Amsterdam', 'San Francisco', 'Seattle']
- name: visitor
tests:
- not_null
- accepted_values:
values: ['Amsterdam', 'San Francisco', 'Seattle']
- name: home_score
tests:
- not_null
- name: visitor_score
tests:
- not_null
- name: winner
tests:
- not_null
- accepted_values:
values: ['Amsterdam', 'San Francisco', 'Seattle']
- name: date
tests:
- not_null
- name: zzz_win_loss_records
columns:
- name: team
tests:
- unique
- not_null
- relationships:
to: ref('zzz_game_details')
field: home
- name: wins
tests:
- not_null
- name: losses
tests:
- not_null
以下のデータテストを記述します。
-- This season's games happened between 2020-12-12 and 2021-02-06.
-- For this test to pass, this query must return no results.
select date
from taka_dbt.zzz_game_details
where date < '2020-12-12'
or date > '2021-02-06'
-- This sport allows no negative scores or tie games.
-- For this test to pass, this query must return no results.
select home_score, visitor_score
from taka_dbt.zzz_game_details
where home_score < 0
or visitor_score < 0
or home_score = visitor_score
-- Each team participated in 4 games this season.
-- For this test to pass, this query must return no results.
select wins, losses
from taka_dbt.zzz_win_loss_records
where wins < 0 or wins > 4
or losses < 0 or losses > 4
or (wins + losses) > 4
筆者注
2021/2/28時点で筆者が確認したところ、dbt Cloudのdbt test
コマンドは--schema
、--data
オプションを受け付けませんでした。コマンドdbt test
のみでスキーマテスト、データテストが実行されることは確認しています。
画面下のRunsの右のテキストボックスに以下のコマンドを入力し、Enterをクリックして実行します。スキーマテストとデータテストが実行されます。
dbt test