LoginSignup
1
0

More than 1 year has passed since last update.

dbt with Databricksで始めるDataOps(2)

Posted at

こちらの記事の続きになります。

検証の流れ

  1. 環境準備
  2. dbtプロジェクトの作成
  3. dbtのモデル作成と実行
  4. より複雑なモデルの作成と実行 ←この記事はここから
  5. dbtのテスト実行

4. より複雑なモデル作成と実行

1. Databricks SQLでのテーブル準備

まずは、Databricks SQLでテーブルとデータを準備します。

DROP TABLE IF EXISTS zzz_game_opponents;
DROP TABLE IF EXISTS zzz_game_scores;
DROP TABLE IF EXISTS zzz_games;
DROP TABLE IF EXISTS zzz_teams;

CREATE TABLE zzz_game_opponents (
  game_id INT,
  home_team_id INT,
  visitor_team_id INT
) USING DELTA
;

INSERT INTO zzz_game_opponents
  VALUES
    (1, 1, 2),
    (2, 1, 3),
    (3, 2, 1),
    (4, 2, 3),
    (5, 3, 1),
    (6, 3, 2)
;
CREATE TABLE zzz_game_scores (
  game_id INT,
  home_team_score INT,
  visitor_team_score INT
) USING DELTA
;

INSERT INTO zzz_game_scores
  VALUES
    (1, 4, 2),
    (2, 0, 1),
    (3, 1, 2),
    (4, 3, 2),
    (5, 3, 0),
    (6, 3, 1)
;
CREATE TABLE zzz_games (
  game_id INT,
  game_date DATE
) USING DELTA
;

INSERT INTO zzz_games
  VALUES
    (1, '2020-12-12'),
    (2, '2021-01-09'),
    (3, '2020-12-19'),
    (4, '2021-01-16'),
    (5, '2021-01-23'),
    (6, '2021-02-06')
;
CREATE TABLE zzz_teams (
  team_id INT,
  team_city VARCHAR(15)
) USING DELTA
;

INSERT INTO zzz_teams
  VALUES
    (1, "San Francisco"),
    (2, "Seattle"),
    (3, "Amsterdam")
;

2. dbtのモデル作成

modelsディレクトリ配下に以下のようなzzz_game_details.sqlを作成します。

-- 各試合詳細情報を提供するテーブルを作成する
-- スキーマは以下の通り
-- game_id int -- ゲームID
-- home varchar(15) -- ホームチーム
-- visitor varchar(15) -- アウェイチーム
-- home_score int -- ホームチームの点数
-- visitor_score int -- アウェイチームの点数
-- winner varchar(15) -- 勝利チーム
-- date date -- 試合日

{{ config(
  materialized='table',
  file_format='delta'
) }}

with all_ids as (
  -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
  select
    go.home_team_id,
    g.game_id,
    gs.home_team_score as home_score,
    go.visitor_team_id,
    gs.visitor_team_score as visitor_score,
    g.game_date
  from
    zzz_games as g,
    zzz_game_opponents as go,
    zzz_game_scores as gs
  where
    g.game_id = go.game_id and
    g.game_id = gs.game_id
), visitor_ids as (
  -- Step 2 of 4: Replace the home team IDs with their actual city names.
  select
    game_id,
    t.team_city as home,
    home_score,
    visitor_team_id,
    visitor_score,
    game_date
  from
    all_ids,
    zzz_teams as t
)
select
  game_id,
  home,
  t.team_city as visitor, -- Step 4 of 4: Replace the visitor team IDs with their city names.
  home_score,
  visitor_score,
  -- Step 3 of 4: Display the city name for each game's winner.
  case
    when
      home_score > visitor_score
        then
          home
    when
      visitor_score > home_score
        then
          t.team_city
  end as winner,
  game_date as date
from
  visitor_ids,
  zzz_teams as t
where
  visitor_ids.visitor_team_id = t.team_id
order by game_date desc
;

同様に、zzz_win_loss_records.sqlも作成します。

-- チームごとの勝敗数サマリのビューを作成します。

with sub as (
  -- Step 1 of 2: Determine the winner and loser for each game.
  select
    game_id,
    winner,
    case
      when
        home = winner
          then
            visitor
      else
        home
    end as loser
  from {{ ref('zzz_game_details') }}
)
-- Step 2 of 2: Calculate the number of wins and losses for each team.
select
  winner as team,
  count(winner) as wins,
  -- Each team played in 4 games.
  (4 - count(winner)) as losses
from 
  sub
group by winner
order by wins desc

以下dbtコマンドを実行してテーブルとビューのマテリアライズします。

dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql

00:00:00  Running with dbt=1.0.0
00:00:00  Found 7 models, 4 tests, 0 snapshots, 0 analyses, 190 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 
metrics
00:00:00
00:00:00  Concurrency: 1 threads (target='dev')
00:00:00
00:00:00  1 of 2 START table model dbt_schema.zzz_game_details............................ [RUN]
00:00:00  1 of 2 OK created table model dbt_schema.zzz_game_details....................... [OK in 5.81s]
00:00:00  2 of 2 START view model dbt_schema.zzz_win_loss_records......................... [RUN]
00:00:00  2 of 2 OK created view model dbt_schema.zzz_win_loss_records.................... [OK in 1.47s]
00:00:00
00:00:00  Finished running 1 table model, 1 view model in 14.66s.
00:00:00
00:00:00  Completed successfully
00:00:00
00:00:00  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

■zzz_game_detailテーブル
image.png

■zzz_win_loss_recordsビュー
image.png

5. dbtのテスト実行

dbtのテストは、singular testsという定型の(パラメータを含まない)SELECT文によるワンオフなテストと、
generic testsと呼ばれるパラメータ化されたクエリによるテストの2種類があります。
(generic testsの詳細例)

singular testsは、testsディレクトリ配下にSQL文を配置して定義します。
generic testsは、testブロックでパラメータされたSQL文を、tests/genericmacrosディレクトリに配置して
さらに、modelsディレクトリ等にYAMLファイルで呼び出して利用します。
いずれのテストでもSELECT文の結果が0件であれば合格(pass)と判定します。

generic testsの定義例

{% test not_null(model, column_name) %}

    select *
    from {{ model }}
    where {{ column_name }} is null

{% endtest %}

また、generic testsには、以下4つの組み込みテストがあり、すぐ利用できるようになっています。
1. unique: 重複する値が存在しない
1. not_null: NULL値が存在しない
1. accepted_value: リストで提供される値のみ取る
1. relationships: 参照先テーブルの参照カラムに値が存在する(参照整合性)

1. generic tests

以下schema.ymlをmodelsディレクトリ配下に配置します。
testsの項目にgeneric testsの名前を指定することで呼び出しを行えます。

version: 2

models:
  - name: zzz_game_details
    columns:
      - name: game_id
        tests:
          - unique
          - not_null
      - name: home
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: visitor
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: home_score
        tests:
          - not_null
      - name: visitor_score
        tests:
          - not_null
      - name: winner
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: date
        tests:
          - not_null
  - name: zzz_win_loss_records
    columns:
      - name: team
        tests:
          - unique
          - not_null
          - relationships:
              to: ref('zzz_game_details')
              field: home
      - name: wins
        tests:
          - not_null
      - name: losses
        tests:
          - not_null

2. singular tests

testsディレクトリにテスト用のSQLファイルを配置します。

zzz_game_details_check_dates.sql
日付の範囲が正しいかチェックする

-- This season's games happened between 2020-12-12 and 2021-02-06.
-- For this test to pass, this query must return no results.

select date
from {{ ref('zzz_game_details') }}
where date < '2020-12-12'
or date > '2021-02-06'

zzz_game_details_check_scores.sql
点数が負値を取ることがないことをチェックする

-- This sport allows no negative scores or tie games.
-- For this test to pass, this query must return no results.

select home_score, visitor_score
from {{ ref('zzz_game_details') }}
where home_score < 0
or visitor_score < 0
or home_score = visitor_score

zzz_game_details_check_records.sql
勝敗数は0-4の範囲であることをチェックする

-- Each team participated in 4 games this season.
-- For this test to pass, this query must return no results.

select wins, losses
from {{ ref('zzz_win_loss_records') }}
where wins < 0 or wins > 4
or losses < 0 or losses > 4
or (wins + losses) > 4

3. dbt testコマンド

dbt testコマンドでテストを実行します。
何もオプションを指定しない場合は、すべてのテストを実行しますが、--selectオプションなどでテスト対象をフィルタすることも可能です。

モデルに「zzz_win_loss_records」を指定、テストタイプはsingular testsのみを指定
(zzz_game_details_check_records.sqlのみ実行される)

dbt test --select zzz_win_loss_records,test_type:singular
00:00:00  Running with dbt=1.0.0
00:00:00  Found 7 models, 23 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:00:00
00:00:00  Concurrency: 1 threads (target='dev')
00:00:00
00:00:00  1 of 1 START test zzz_game_detals_check_records................................. [RUN]
00:00:00  1 of 1 FAIL 2 zzz_game_detals_check_records..................................... [FAIL 2 in 1.72s]
00:00:00
00:00:00  Finished running 1 test in 5.84s.
00:00:00
00:00:00  Completed with 1 error and 0 warnings:
00:00:00
00:00:00  Failure in test zzz_game_detals_check_records (tests\zzz_game_detals_check_records.sql)
00:00:00    Got 2 results, configured to fail if != 0
00:00:00
00:00:00    compiled SQL at target\compiled\dbt_demo_project\tests\zzz_game_detals_check_records.sql
00:00:00
00:00:00  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

zzz_win_loss_recordsビューは以下のような結果を返すので、lossesが0未満の値を含むので失敗します。
image.png

次にzzz_game_detailsテーブルのgeneric testsを実行してみます。
以下コマンドでzzz_win_loss_recordsモデルの参照整合性テストを含む12のgeneric testsが実行されます。

dbt test --select zzz_game_details,test_type:generic

00:00:00  Running with dbt=1.0.0
00:00:00  Found 7 models, 23 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:00:00  
00:00:00  Concurrency: 1 threads (target='dev')
00:00:00  
00:00:00  1 of 12 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]
00:00:00  1 of 12 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS in 1.15s]
00:00:00  2 of 12 START test accepted_values_zzz_game_details_visitor__Amsterdam__San_Francisco__Seattle [RUN]
00:00:00  2 of 12 PASS accepted_values_zzz_game_details_visitor__Amsterdam__San_Francisco__Seattle [PASS in 1.15s]
00:00:00  3 of 12 START test accepted_values_zzz_game_details_winner__Amsterdam__San_Francisco__Seattle [RUN]
00:00:00  3 of 12 PASS accepted_values_zzz_game_details_winner__Amsterdam__San_Francisco__Seattle [PASS in 1.03s]
00:00:00  4 of 12 START test not_null_zzz_game_details_date............................... [RUN]
00:00:00  4 of 12 PASS not_null_zzz_game_details_date..................................... [PASS in 1.03s]
00:00:00  5 of 12 START test not_null_zzz_game_details_game_id............................ [RUN]
00:00:00  5 of 12 PASS not_null_zzz_game_details_game_id.................................. [PASS in 0.99s]
00:00:00  6 of 12 START test not_null_zzz_game_details_home............................... [RUN]
00:00:00  6 of 12 PASS not_null_zzz_game_details_home..................................... [PASS in 1.01s]
00:00:00  7 of 12 START test not_null_zzz_game_details_home_score......................... [RUN]
00:00:00  7 of 12 PASS not_null_zzz_game_details_home_score............................... [PASS in 1.01s]
00:00:00  8 of 12 START test not_null_zzz_game_details_visitor............................ [RUN]
00:00:00  8 of 12 PASS not_null_zzz_game_details_visitor.................................. [PASS in 0.99s]
00:00:00  9 of 12 START test not_null_zzz_game_details_visitor_score...................... [RUN]
00:00:00  9 of 12 PASS not_null_zzz_game_details_visitor_score............................ [PASS in 1.07s]
00:00:00  10 of 12 START test not_null_zzz_game_details_winner............................ [RUN]
00:00:00  10 of 12 PASS not_null_zzz_game_details_winner.................................. [PASS in 2.20s]
00:00:00  11 of 12 START test relationships_zzz_win_loss_records_team__home__ref_zzz_game_details_ [RUN]
00:00:00  11 of 12 PASS relationships_zzz_win_loss_records_team__home__ref_zzz_game_details_ [PASS in 1.02s]
00:00:00  12 of 12 START test unique_zzz_game_details_game_id............................. [RUN]
00:00:00  12 of 12 FAIL 6 unique_zzz_game_details_game_id................................. [FAIL 6 in 1.02s]
00:00:00
00:00:00  Finished running 12 tests in 17.01s.
00:00:00
00:00:00  Completed with 1 error and 0 warnings:
00:00:00
00:00:00  Failure in test unique_zzz_game_details_game_id (models\schema.yml)
00:00:00    Got 6 results, configured to fail if != 0
00:00:00
00:00:00    compiled SQL at target\compiled\dbt_demo_project\models\schema.yml\unique_zzz_game_details_game_id.sql
00:00:00
00:00:00  Done. PASS=11 WARN=0 ERROR=1 SKIP=0 TOTAL=12
1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0