「dbt-core入門」dbtを使ってRenderにデプロイしたPostgreSQLのデータ変換する

Last updated at 2025-01-06Posted at 2025-01-03

はじめに

dbt（Data Build Tool）は、SQLでデータの変換や集計を効率的に管理するツールです。この記事では、Pythonの仮想環境を使ってdbtをインストールし、RenderにデプロイしたPostgreSQLデータベースに接続してデータ変換を行う方法を紹介します。

1. 必要なツールのインストール

まず、dbtを使うために必要なツールをインストールします。

1.1 Python仮想環境の作成

Pythonの仮想環境を作成し、必要なパッケージをインストールします。

python3 -m venv dbtenv

仮想環境をアクティブ化します。

macOS/Linux:

source dbtenv/bin/activate

1.2 `pip` のアップグレード

仮想環境が有効になった後、pipを最新のバージョンにアップグレードします。

pip install --upgrade pip

これで、pipが最新の安定版に更新されます。

1.3 dbt Core のインストール

次に、dbt CoreとPostgreSQL用のdbtアダプターをインストールします。

pip install dbt-core dbt-postgres

インストール後、dbt --version を実行して、インストールされたバージョンを確認します。

dbt --version

2. PostgreSQL データベースの準備

dbtは、SQLで記述した変換処理を実行します。そのため、まずはPostgreSQLデータベースを準備します。以下のSQLスクリプトを使って、workout_recordsテーブルを作成します。今回はすでにRenderで作成しているものを使用していきます。
Renderに PostgreSQL データベースを作成する

CREATE TABLE IF NOT EXISTS workout_records (
    id SERIAL PRIMARY KEY,
    exercise VARCHAR(255) NOT NULL,
    weight INT NOT NULL,
    reps INT NOT NULL,
    sets INT NOT NULL,
    date DATE NOT NULL
);

--データを入れる
INSERT INTO workout_records (exercise, weight, reps, sets, date) 
VALUES
  ('ベンチプレス', 90, 6, 5, '2024-12-30'),
  ('インクラインベンチプレス', 40, 10, 4, '2024-12-30'),
  ('アームカール', 30, 7, 5, '2025-01-03'),
  ('ベンチプレス', 100, 3, 5, '2025-01-03'),
  ('ペクトラルフライ', 30, 7, 5, '2025-01-03'),
  ('スクワット', 70, 5, 5, '2025-01-03'),
  ('ワイパー', 0, 20, 4, '2025-01-03'),
  ('デッドリフト', 120, 5, 4, '2025-01-04'),
  ('ラットプルダウン', 50, 8, 4, '2025-01-04'),
  ('バーベルスクワット', 80, 6, 5, '2025-01-05');

PSQL Commandでターミナルから実際のテーブルを確認してみます。

psql -h *******.oregon-postgres.render.com -U root workoutdb_k0zf
↓
以下のようにパスを要求されるので、PGPASSWORDの値を入力します
Password for user root: 
↓
PGPASSWORD=*****

以下workoutdb_k0zfデータベース内

workoutdb_k0zf=> \dt

            List of relations
 Schema |      Name       | Type  | Owner 
--------+-----------------+-------+-------
 public | workout_records | table | root
(1 row)

workoutdb_k0zf=> SELECT * FROM public.workout_records LIMIT 10;

 id |         exercise         | weight | reps | sets |    date    
----+--------------------------+--------+------+------+------------
  4 | ベンチプレス             |     90 |    6 |    5 | 2024-12-30
  5 | インクラインベンチプレス |     40 |   10 |    4 | 2024-12-30
  6 | アームカール             |     30 |    7 |    5 | 2025-01-03
  7 | ベンチプレス             |    100 |    3 |    5 | 2025-01-03
  8 | ペクトラルフライ         |     30 |    7 |    5 | 2025-01-03
  9 | スクワット               |     70 |    5 |    5 | 2025-01-03
 10 | ワイパー                 |      0 |   20 |    4 | 2025-01-03
(7 rows)


\q　--でデータベースを出ることができます。

このテーブルは、ワークアウトの記録を保存するために必要なカラム（種目、重量、レップ数、セット数、日付）を持っています。

3. dbt プロジェクトの作成

dbtをセットアップするためには、まずdbtプロジェクトを作成します。

dbt init workout_project

これでworkout_projectという名前のdbtプロジェクトが作成されます。このプロジェクトには、モデルや設定ファイルが含まれています。

3.1 プロジェクトディレクトリへ移動

cd workout_project

4. PostgreSQL接続設定

次に、dbtがPostgreSQLに接続できるように設定します。dbtではprofiles.ymlという設定ファイルを使って接続情報を管理します。

profiles.ymlファイルは、通常~/.dbt/ディレクトリ内にあります。もし存在しない場合は手動で作成できます。以下の内容を設定します。

workout_project:
  target: dev
  outputs:
    dev:
      type: postgres
      host: your_postgresql_host
      user: your_user
      password: your_password
      dbname: your_database
      schema: public
      port: 5432
      threads: 1

your_postgresql_host、your_user、your_password、your_databaseには実際の接続情報を入力してください。

設定後、接続確認のために以下のコマンドを実行します。

dbt debug

接続に問題がない場合、正常にセットアップが完了します。

5. dbt モデルの作成

ここでは、dbtのモデルをstaging、intermediate、martsの3つの層に分けて作成します。

5.1 Staging モデル: `stg_workout_records.sql`

staging層では、生のデータをクリーニングしたり、基本的な変換を行います。この層では、workout_recordsテーブルをそのまま読み込み、後続の処理に備えます。

models/staging/stg_workout_records.sqlを以下のように作成します。

{{ config(
    materialized='table'
) }}

select
    id,
    exercise,
    weight,
    reps,
    sets,
    date
from {{ source('public', 'workout_records') }}

作ったdbtモデルはPower User for dbtを使うとクエリ結果を確認することもできます。

5.2 Intermediate モデル: `int_workout_summary.sql`

intermediate層では、集計や中間的な処理を行います。ここでは、ワークアウトの重量平均とレップ数、セット数を集計します。

models/intermediate/int_workout_summary.sqlを以下のように作成します。

{{ config(
    materialized='table'
) }}

with base as (
    select
        id,
        exercise,
        weight as avg_weight,  -- 各種目の平均重量を計算
        reps as total_reps,
        sets as total_sets
    from {{ ref('stg_workout_records') }}
)

select
    exercise,
    avg(avg_weight) as average_weight,  -- 種目ごとの平均重量
    sum(total_reps) as total_reps,
    sum(total_sets) as total_sets
from base
group by exercise

5.3 Marts モデル: `marts_workout_summary.sql`

marts層では、最終的なビジネス指標やレポーティング用のデータを作成します。最終的なワークアウトの集計結果を取得します。

models/marts/marts_workout_summary.sqlを以下のように作成します。

{{ config(
    materialized='table'
) }}

with base as (
    select
        exercise,
        average_weight,  -- 正しいカラム名を使用（ここでは例として average_weight）
        total_reps,
        total_sets
    from {{ ref('int_workout_summary') }}
)

-- マート向けにさらに加工
select
    exercise,
    average_weight,
    total_reps,
    total_sets,
    case 
        when average_weight > 100 then 'Heavy'
        when average_weight > 50 then 'Medium'
        else 'Light'
    end as weight_category
from base

6. モデル用の`schema.yml`ファイルの作成

各モデルに対応するschema.ymlファイルを作成し、モデルに関するメタデータやテストを定義します。

6.1 `models/staging/stg_workout_records.yml`

version: 2

sources:
  - name: public
    tables:
      - name: workout_records

models:
  - name: stg_workout_records
    description: "Raw workout records from the workout_records table."
    columns:
      - name: exercise
        description: "Exercise name."
      - name: weight
        description: "Weight lifted."
      - name: reps
        description: "Repetitions performed."
      - name: sets
        description: "Number of sets performed."
      - name: date
        description: "Date of the workout."

6.2 `models/intermediate/int_workout_summary.yml`

version: 2

models:
  - name: int_workout_summary
    description: "Workout summary with total weight lifted and sets."
    columns:
      - name: total_weight_lifted
        description: "Total weight lifted during the workout."
      - name: total_sets
        description: "Total number of sets performed."

6.3 `models/marts/marts_workout_summary.yml`

version: 2

models:
  - name: marts_workout_summary
    description: "Final summary of workouts by exercise, including average weight and performance category."
    columns:
      - name: exercise
        description: "Name of the exercise."
      - name: average_weight
        description: "Average weight lifted for the exercise."
      - name: total_reps
        description: "Total number of repetitions performed for the exercise."
      - name: total_sets
        description: "Total number of sets performed for the exercise."
      - name: weight_category
        description: "Categorization of the exercise based on average weight (e.g., Light, Medium, Heavy)."

7. `dbt_project.yml`とディレクトリ構造の説明

7.1 `dbt_project.yml`の設定

dbt_project.ymlは、dbtプロジェクトの設定ファイルで、モデルの実行や依存関係などを指定します。このファイルはプロジェクトのルートディレクトリに配置されます。以下はdbt_project.ymlの基本的な設定例です。

# dbt_project.yml
name: 'workout_project'  # プロジェクト名。小文字とアンダースコアのみ使用。
version: '1.0.0'  # プロジェクトのバージョン

# どのプロファイルを使用するか設定
profile: 'workout_project'

# モデルのパス設定
model-paths: ["models"]  # モデルが格納されているディレクトリ

# 分析やテスト、シードファイルのパス設定
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

# dbtが削除するディレクトリを指定
clean-targets:
  - "target"
  - "dbt_packages"

# モデルの設定（materialized）
models:
  workout_project:
    staging:
      +materialized: table  # Stagingモデルはテーブルとしてマテリアライズ
    intermediate:
      +materialized: table  # Intermediateモデルもテーブルとしてマテリアライズ
    marts:
      +materialized: table  # Martsモデルもテーブルとしてマテリアライズ

設定項目の説明：

name: プロジェクト名。小文字とアンダースコアを使って指定します。
version: プロジェクトのバージョン番号。
profile: 使用する接続プロファイルを指定。profiles.ymlで定義された接続設定を使います。
model-paths: dbtがSQLモデルを探すディレクトリを指定。通常はmodels/ディレクトリです。
clean-targets: dbt cleanコマンドで削除するターゲットディレクトリを指定します。通常はtargetやdbt_packagesです。
models: 各モデルの設定を記述します。ここで、staging、intermediate、martsディレクトリにあるモデルに対して、マテリアライズ方法（テーブルやビューなど）を指定します。

7.2 ディレクトリ構造

models/ディレクトリの下にstaging、intermediate、martsディレクトリを配置することが一般的です。これにより、データの処理フローを段階的に分けて管理できます。

ディレクトリ構造例：

workout_project/
  ├── models/
  │   ├── staging/
  │   │   ├── workout_records.sql      # Stagingモデル
  │   │   └── another_staging_model.sql
  │   ├── intermediate/
  │   │   ├── workout_summary.sql      # Intermediateモデル
  │   │   └── another_intermediate_model.sql
  │   └── marts/
  │       ├── final_report.sql         # Martsモデル
  │       └── another_mart_model.sql
  ├── dbt_project.yml  # プロジェクト設定ファイル
  └── profiles.yml     # データベース接続設定ファイル

各ディレクトリとその役割：

staging:
- 生データの準備やクリーニングを行うモデルを格納します。
- 通常、stagingモデルは外部ソースからのデータをロードして、標準的な形式に整形します。
- 例: workout_records.sql（生のワークアウトデータを整理）
intermediate:
- stagingモデルのデータを基に、ビジネスロジックに合った処理を行うモデルを格納します。
- ここでは、データの集計や変換が行われ、分析に必要な中間的なデータを作成します。
- 例: workout_summary.sql（種目ごとの総重量やセット数を集計）
marts:
- 最終的なビジネスインテリジェンス用のデータ（レポートやダッシュボードなど）を格納します。
- ここでは、intermediateモデルから取得したデータを基に、最終的なレポートや集計結果を出力します。
- 例: final_report.sql（ユーザー向けのレポートやダッシュボード）

8. モデルの実行

テストが問題なく通過したら、以下のコマンドでモデルを実行し、データ変換を行います。

dbt run

実行が成功すると以下のようにターミナルに表示されるはずです。

05:46:52  Running with dbt=1.9.1
05:46:53  Registered adapter: postgres=1.9.0
05:46:53  Found 3 models, 1 source, 431 macros
05:46:53  
05:46:53  Concurrency: 1 threads (target='dev')
05:46:53  
05:46:57  1 of 3 START sql table model public.stg_workout_records ........................ [RUN]
05:46:58  1 of 3 OK created sql table model public.stg_workout_records ................... [SELECT 7 in 1.48s]
05:46:58  2 of 3 START sql table model public.int_workout_summary ........................ [RUN]
05:46:59  2 of 3 OK created sql table model public.int_workout_summary ................... [SELECT 7 in 1.35s]
05:46:59  3 of 3 START sql table model public.marts_workout_summary ...................... [RUN]
05:47:01  3 of 3 OK created sql table model public.marts_workout_summary ................. [SELECT 6 in 1.58s]
05:47:02  
05:47:02  Finished running 3 table models in 0 hours 0 minutes and 9.33 seconds (9.33s).
05:47:02  
05:47:02  Completed successfully
05:47:02  
05:47:02  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

これで、dbtがPostgreSQLに接続し、定義したSQLモデルを実行して結果を保存されます。以下Power User for dbtからみたデータモデルのリネージになります。

はじめに

このように、dbtを使ってPostgreSQLに接続し、ワークアウト記録を管理するために、staging、intermediate、marts層を活用して効率的にデータを変換・集計する方法を紹介しました。dbtを利用することで、データの品質管理が簡単に行えるとともに、データ変換や集計処理の再利用性も向上します。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

「dbt-core入門」dbtを使ってRenderにデプロイしたPostgreSQLのデータ変換する

はじめに

1. 必要なツールのインストール

1.1 Python仮想環境の作成

1.2 pip のアップグレード

1.3 dbt Core のインストール

2. PostgreSQL データベースの準備

3. dbt プロジェクトの作成

3.1 プロジェクトディレクトリへ移動

4. PostgreSQL接続設定

5. dbt モデルの作成

5.1 Staging モデル: stg_workout_records.sql

5.2 Intermediate モデル: int_workout_summary.sql

5.3 Marts モデル: marts_workout_summary.sql

6. モデル用のschema.ymlファイルの作成

6.1 models/staging/stg_workout_records.yml

6.2 models/intermediate/int_workout_summary.yml

6.3 models/marts/marts_workout_summary.yml

7. dbt_project.ymlとディレクトリ構造の説明

7.1 dbt_project.ymlの設定

設定項目の説明：

7.2 ディレクトリ構造

ディレクトリ構造例：

各ディレクトリとその役割：

8. モデルの実行

はじめに

1.2 `pip` のアップグレード

5.1 Staging モデル: `stg_workout_records.sql`

5.2 Intermediate モデル: `int_workout_summary.sql`

5.3 Marts モデル: `marts_workout_summary.sql`

6. モデル用の`schema.yml`ファイルの作成

6.1 `models/staging/stg_workout_records.yml`

6.2 `models/intermediate/int_workout_summary.yml`

6.3 `models/marts/marts_workout_summary.yml`

7. `dbt_project.yml`とディレクトリ構造の説明

7.1 `dbt_project.yml`の設定