ZOZOAdvent Calendar 2024

dbt-checkpointを活用したコード検証

Last updated at 2024-12-11Posted at 2024-12-11

本記事は ZOZO Advent Calendar 2024 シリーズ 4 の 12 日目の記事です。

概要

今回は dbt プロジェクトのコード品質を保つために dbt-chekpoint を活用した話と、標準ではチェックできない model configure のバリデーションチェックをカスタマイズ実装した話を紹介します。
※dbt-core = "1.8.6" かつ dbt-bigquery = "1.8.2" を利用している前提の記事となります。

dbt-checkpoint 活用の具体例

私たちの dbt プロジェクトは大規模なものになる可能性があったため、SQL やモデルの命名規則、設定の一貫性、ドキュメントの充実度など、コードレビューで指摘されることが多い項目を自動でチェックしたいと考えていました。
それを実現するのが dbt-chekpoint だったため、導入することにしました。

具体的な導入手順については、@takemikami さんによるとても分かりやすい記事が紹介されていましたので、こちらを参照ください！

たとえば私たちの dbt プロジェクトではチェック内容を定義する.pre-commit-config.yamlファイルには以下のように記述しています。

.pre-commit-config.yaml

repos:
- repo: https://github.com/dbt-checkpoint/dbt-checkpoint
  rev: v1.2.0
  hooks:
  # セミコロンが含まれていないことをチェック
  - id: check-script-semicolon
    files: ^models
  # テーブル名を直接使用せず、source()またはref()マクロを使用していることをチェック
  # `.`が含まれていないテーブルはCTE（共通テーブル式）として扱う
  - id: check-script-has-no-table-name
    files: ^models
    args: ["--ignore-dotless-table"]
  # 存在するsourceおよびrefのみを使用していることをチェック
  - id: check-script-ref-and-source
    files: ^models
  # models/marts配下のモデルにdecriptionが含まれていることをチェック
  - id: check-model-has-description
    files: ^models/marts
  # models/marts配下のモデルにプロパティファイル(.yml)が含まれていることをチェック
  - id: check-model-has-properties-file
    files: ^models/marts

そしてこの設定内容を元に、Github Actions で main ブランチに対する PR 作成をトリガーに実行するように設定しました。

main-pr.yaml

name: Validate dbt project
on:
  pull_request:
    branches:
      - main
    types: [opened, synchronize]

jobs:
  check-pre-commit:
    - name: Run dbt compile
      run: |
        pipenv run dbt compile

    - name: Run pre-commit
      run: |
        pipenv run pre-commit run --color never | tee -a pre-commit.txt
    
    - name: Check pre-commit status
      run: |
        RESULT=$(cat pre-commit.txt)
        if [[ $RESULT == *Failed* ]]; then
          echo "❌Failed ${{ matrix.changed_file }}:ルール違反が発生しています"
          exit 1
        fi

なぜバリデーションチェックをカスタマイズする必要があったのか

dbt-checkpoint は、dbt プロジェクト内のさまざまな項目をチェックできるツールですが、標準のチェック項目では対応しきれないケースもあります。その一例が、dbt モデルの設定内容を確認したい場合です。特に materialized が incremental のモデルにおいて、以下のいずれかの設定が正しく行われているかをチェックしたいというニーズがあります。

unique_keyが設定されている
incremental_strategy="insert_overwrite" が設定されており、かつpartition_byが設定されている

ここで重要なのは、incremental_strategy の挙動が利用するデータベースアダプタによって異なる点です。

例えば、BigQuery を使用している場合、incremental_strategyを指定しないと、デフォルトで merge strategy が適用されます。この場合、unique_keyが設定されていると、テーブル内のすべてのデータを確認し、unique_keyに一致する行を更新します。しかしunique_keyが設定されていない場合、データを適切に識別して更新する方法がなくなり、意図しない重複データが生成されることになります。これは特に増分データの取り扱いにおいて、集計結果に誤りが生じる原因となります。
また、insert_overwrite strategy を使用している場合でも、partition_byが適切に設定されていないと、同様の問題が生じます。

モデル側で冪等性を担保するためにも、このような model configure がないかをチェックしておきたいというのがモチベーションでした。

カスタムしたバリデーションチェック実装

結論としては今回の要件を満たす、dbt-checkpoint 上で設定可能なルールは見つけることができませんでした。そのため以下のように bash で自前実装してみました。

main-pr.yaml

name: Validate dbt project
on:
  pull_request:
    branches:
      - main
    types: [opened, synchronize]

jobs:
  changed-files:
    runs-on: ubuntu-22.04
    outputs:
      changed_files: ${{ steps.set-env.outputs.changed_files }}

    steps:
    - name: Check out repository code
      uses: actions/checkout@v4
      with:
        fetch-depth: 0
        ref: ${{ github.head_ref }}

    - name: Get changed files
      id: changed-files
      uses: tj-actions/changed-files@v41
      with:
        json: true
        quotepath: false
        files: |
          models/**/*.sql
          models/**/*.yml

    - name: List changed files
      if: steps.changed-files.outputs.any_changed == 'true'
      id: set-env
      run: |
        echo "${{ steps.changed-files.outputs.all_changed_files }}"
        echo "changed_files=${{ steps.changed-files.outputs.all_changed_files }}" >> "$GITHUB_OUTPUT"
        
  check-model-configuration:
    needs: [changed-files]
    if: ${{ needs.changed-files.outputs.changed_files }}
    runs-on: ubuntu-22.04
    strategy:
      matrix:
        changed_file: ${{ fromJSON(needs.changed-files.outputs.changed_files) }}
      fail-fast: false

    steps:
    - name: Check out repository code
      uses: actions/checkout@v4
      with:
        fetch-depth: 0
        ref: ${{ github.head_ref }}

    - name: Check model configuration
      run: |
        if grep -Eq "materialized=(\"|\')incremental(\"|\')" ${{ matrix.changed_file }}; then
            has_unique_key=$(grep -q "unique_key=" ${{ matrix.changed_file }} && echo "true" || echo "false")
            has_insert_overwrite=$(grep -Eq "incremental_strategy=(\"|\')insert_overwrite(\"|\')" ${{ matrix.changed_file }} && echo "true" || echo "false")
            has_partition_by=$(grep -q "partition_by=" ${{ matrix.changed_file }} && echo "true" || echo "false")

            echo "has_unique_key: $has_unique_key"
            echo "has_insert_overwrite: $has_insert_overwrite"
            echo "has_partition_by: $has_partition_by"

            if ! { [ "$has_unique_key" = "true" ] || { [ "$has_insert_overwrite" = "true" ] && [ "$has_partition_by" = "true" ]; } }; then
                echo "❌Failed ${{ matrix.changed_file }}:incrementalモデルは冪等性を担保するため、unique_keyを設定するか、partition_byを使用したinsert_overwrite戦略を設定してください。"
                exit 1
            fi
        fi

tj-actions/changed-filesを利用して変更されたファイルを取得し、GitHub Actions のmatrix 機能を利用してそのファイルごとに並列実行で処理を行っています。

まとめ

本記事では dbt プロジェクトのコード品質を保つために、dbt-chekpoint を活用している話とそれではチェックできない内容をチェックする方法を紹介しました。
今回のように dbt-chekpoint ではチェックできない項目もありますが、基本的なチェック項目は揃っている印象です。こういったチェックをできるだけ CI/CD に実施させることで、レビュアーが意識するべきことは大幅に減り、データモデリングやロジックのレビューに集中できるので非常に便利です。
ぜひ使ってみてください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up