0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Databricksとdbt Cloudの連携(実践編その2)

Last updated at Posted at 2022-02-27

Databricksとdbt Cloudの連携ステップ4: より複雑なモデルを作成して実行するステップ5: テストを作成して実行するを実践した内容です。ステップ3以前の内容についてはDatabricksとdbt Cloudの連携(実践編その1)をご覧ください。

テーブルの作成

ここではデータベースtaka_dbtに以下のテーブルを作成します。

SQL
DROP TABLE IF EXISTS zzz_game_opponents;
DROP TABLE IF EXISTS zzz_game_scores;
DROP TABLE IF EXISTS zzz_games;
DROP TABLE IF EXISTS zzz_teams;

CREATE TABLE zzz_game_opponents (
game_id INT,
home_team_id INT,
visitor_team_id INT
) USING DELTA;

INSERT INTO zzz_game_opponents VALUES (1, 1, 2);
INSERT INTO zzz_game_opponents VALUES (2, 1, 3);
INSERT INTO zzz_game_opponents VALUES (3, 2, 1);
INSERT INTO zzz_game_opponents VALUES (4, 2, 3);
INSERT INTO zzz_game_opponents VALUES (5, 3, 1);
INSERT INTO zzz_game_opponents VALUES (6, 3, 2);

CREATE TABLE zzz_game_scores (
game_id INT,
home_team_score INT,
visitor_team_score INT
) USING DELTA;

INSERT INTO zzz_game_scores VALUES (1, 4, 2);
INSERT INTO zzz_game_scores VALUES (2, 0, 1);
INSERT INTO zzz_game_scores VALUES (3, 1, 2);
INSERT INTO zzz_game_scores VALUES (4, 3, 2);
INSERT INTO zzz_game_scores VALUES (5, 3, 0);
INSERT INTO zzz_game_scores VALUES (6, 3, 1);

CREATE TABLE zzz_games (
game_id INT,
game_date DATE
) USING DELTA;

INSERT INTO zzz_games VALUES (1, '2020-12-12');
INSERT INTO zzz_games VALUES (2, '2021-01-09');
INSERT INTO zzz_games VALUES (3, '2020-12-19');
INSERT INTO zzz_games VALUES (4, '2021-01-16');
INSERT INTO zzz_games VALUES (5, '2021-01-23');
INSERT INTO zzz_games VALUES (6, '2021-02-06');

CREATE TABLE zzz_teams (
team_id INT,
team_city VARCHAR(15)
) USING DELTA;

INSERT INTO zzz_teams VALUES (1, "San Francisco");
INSERT INTO zzz_teams VALUES (2, "Seattle");
INSERT INTO zzz_teams VALUES (3, "Amsterdam");

上のSQLを実行すると、このように新たにテーブルが作成されます。
Screen Shot 2022-02-28 at 7.30.37.png

モデルの定義

こちらで説明されているようにモデルを作成していきます。データベースを独自のものにしていますので、若干SQL文を変更しています。参照先のデータベースを一括で設定する方法ご存知の方いらしたら教えていただけると幸いです。

models/zzz_game_details.sql
{{ config(
  materialized='table',
  file_format='delta'
) }}

-- Step 4 of 4: Replace the visitor team IDs with their city names.
select
  game_id,
  home,
  t.team_city as visitor,
  home_score,
  visitor_score,
  -- Step 3 of 4: Display the city name for each game's winner.
  case
    when
      home_score > visitor_score
        then
          home
    when
      visitor_score > home_score
        then
          t.team_city
  end as winner,
  game_date as date
from (
  -- Step 2 of 4: Replace the home team IDs with their actual city names.
  select
    game_id,
    t.team_city as home,
    home_score,
    visitor_team_id,
    visitor_score,
    game_date
  from (
    -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
    select
      g.game_id,
      go.home_team_id,
      gs.home_team_score as home_score,
      go.visitor_team_id,
      gs.visitor_team_score as visitor_score,
      g.game_date
    from
      taka_dbt.zzz_games as g,
      taka_dbt.zzz_game_opponents as go,
      taka_dbt.zzz_game_scores as gs
    where
      g.game_id = go.game_id and
      g.game_id = gs.game_id
  ) as all_ids,
    taka_dbt.zzz_teams as t
  where
    all_ids.home_team_id = t.team_id
) as visitor_ids,
  taka_dbt.zzz_teams as t
where
  visitor_ids.visitor_team_id = t.team_id
order by game_date desc
models/zzz_win_loss_records.sql
{{ config(
  database="taka_dbt"
) }}

-- Create a view that summarizes the season's win and loss records by team.

-- Step 2 of 2: Calculate the number of wins and losses for each team.
select
  winner as team,
  count(winner) as wins,
  -- Each team played in 4 games.
  (4 - count(winner)) as losses
from (
  -- Step 1 of 2: Determine the winner and loser for each game.
  select
    game_id,
    winner,
    case
      when
        home = winner
          then
            visitor
      else
        home
    end as loser
  from taka_dbt.zzz_game_details
)
group by winner
order by wins desc

モデルの実行

画面下のRunsの右のテキストボックスに以下のコマンドを入力し、Enterをクリックして実行します。

Bash
dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql

Screen Shot 2022-02-28 at 7.29.54.png

処理が完了するとテーブルが追加されます。

Screen Shot 2022-02-28 at 7.32.07.png

ビューが追加されていることを確認します。
Screen Shot 2022-02-28 at 7.34.45.png

追加されたテーブルの中身を確認します。
Screen Shot 2022-02-28 at 7.34.57.png
Screen Shot 2022-02-28 at 7.35.06.png

テストの定義

プロジェクトにmodels/schema.ymlを追加します。以下の内容を貼り付けます。

models/schema.yml
version: 2

models:
  - name: zzz_game_details
    columns:
      - name: game_id
        tests:
          - unique
          - not_null
      - name: home
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: visitor
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: home_score
        tests:
          - not_null
      - name: visitor_score
        tests:
          - not_null
      - name: winner
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: date
        tests:
          - not_null
  - name: zzz_win_loss_records
    columns:
      - name: team
        tests:
          - unique
          - not_null
          - relationships:
              to: ref('zzz_game_details')
              field: home
      - name: wins
        tests:
          - not_null
      - name: losses
        tests:
          - not_null

以下のデータテストを記述します。

tests/zzz_game_details_check_dates.sql
-- This season's games happened between 2020-12-12 and 2021-02-06.
-- For this test to pass, this query must return no results.

select date
from taka_dbt.zzz_game_details
where date < '2020-12-12'
or date > '2021-02-06'
tests/zzz_game_details_check_scores.sql
-- This sport allows no negative scores or tie games.
-- For this test to pass, this query must return no results.

select home_score, visitor_score
from taka_dbt.zzz_game_details
where home_score < 0
or visitor_score < 0
or home_score = visitor_score
tests/zzz_win_loss_records_check_records.sql
-- Each team participated in 4 games this season.
-- For this test to pass, this query must return no results.

select wins, losses
from taka_dbt.zzz_win_loss_records
where wins < 0 or wins > 4
or losses < 0 or losses > 4
or (wins + losses) > 4

筆者注
2021/2/28時点で筆者が確認したところ、dbt Cloudのdbt testコマンドは--schema--dataオプションを受け付けませんでした。コマンドdbt testのみでスキーマテスト、データテストが実行されることは確認しています。

画面下のRunsの右のテキストボックスに以下のコマンドを入力し、Enterをクリックして実行します。スキーマテストとデータテストが実行されます。

Bash
dbt test

Screen Shot 2022-02-28 at 8.22.49.png

Databricks 無料トライアル

Databricks 無料トライアル

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?