dbt Cloud DocsをAPIで取得してS3に保存する方法

Posted at 2025-08-20

はじめに

dbt Cloudで生成されるドキュメント（dbt docs）は非常に便利ですが、dbt Cloud内でしか閲覧できません。この記事では、dbt docsを他のホスティングサービスで利用するためにdbt Cloud APIを使ってdocsのアーティファクト（HTML、JSON）を取得し、S3に保存する方法を紹介します。
S3へdocsのアーティファクトを保存した後、実施にホスティングを行う設定は以下の記事にて紹介しています。
CloudFront ディストリビューションの作成（OAC + S3バケット連携）

実現できること

dbt docsのファイルをS3に自動保存
外部システムからのdbtメタデータ活用
バックアップとしてのアーティファクト保存
自動更新による最新ドキュメントの取得

前提条件

dbt Cloudでプロジェクトが設定済み
dbt docs generateを含むジョブが定期実行されている
AWS アカウント（S3、Lambda）
dbt Cloud API Token

アーキテクチャ

dbt Cloud → Lambda (API取得) → S3

S3に保存されたファイルは：

直接ダウンロードして確認
他のシステムからAPI経由で取得
将来的にCloudFrontなどでホスティング
などの用途で活用できます。

必要な情報の収集

まず、dbt Cloudから以下の情報を取得します：

1. API Token

dbt Cloud → Account Settings → API Tokens で生成

2. Account ID, Project ID, Job ID

dbt Cloud UIのURLから取得：

https://abc123.us1.dbt.com/deploy/123456789012345/projects/123456789054321/runs/123456789098765
                                 ↑Account ID      ↑Project ID        ↑Run ID

3. Account Prefix

URLの最初の部分（例：abc123）

Lambda関数の実装

以下のPythonコードでdbt Cloud APIからアーティファクトを取得してS3に保存します：

from urllib import request, error
import boto3
import json
from datetime import datetime

# dbt Cloud APIの設定
DBT_CLOUD_API_TOKEN = 'dbtc_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
DBT_CLOUD_ACCOUNT_ID = '123456789012345'
DBT_CLOUD_ACCOUNT_PREFIX = 'abc123'
DBT_CLOUD_JOB_ID = '123456789098765'
DBT_CLOUD_PROJECT_ID = '123456789054321'

# S3の設定
S3_BUCKET_NAME = 'my-dbt-docs-bucket'
S3_KEY_PREFIX = 'dbt-docs/'

def get_content_type(filename):
    """ファイル拡張子からContent-Typeを決定"""
    if filename.endswith('.html'):
        return 'text/html'
    elif filename.endswith('.js'):
        return 'application/javascript'
    elif filename.endswith('.css'):
        return 'text/css'
    elif filename.endswith('.json'):
        return 'application/json'
    else:
        return 'application/octet-stream'

def save_to_s3(content, bucket, key, content_type='application/json'):
    """S3にファイルを保存"""
    s3 = boto3.client('s3')
    
    extra_args = {
        'ContentType': content_type
    }
    
    if isinstance(content, str):
        content = content.encode('utf-8')
    
    s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=content,
        **extra_args
    )

def get_project_artifacts():
    """最新のdbt docsアーティファクトを取得"""
    base_url = f'https://{DBT_CLOUD_ACCOUNT_PREFIX}.us1.dbt.com/api/v2/'
    headers = {'Authorization': f'Token {DBT_CLOUD_API_TOKEN}'}
    
    # 最新の実行IDを取得
    runs_url = f'{base_url}accounts/{DBT_CLOUD_ACCOUNT_ID}/runs/'
    req = request.Request(runs_url, headers=headers)
    
    with request.urlopen(req) as response:
        runs_data = json.loads(response.read().decode())
        all_runs = runs_data['data']
        
        # 指定ジョブの最新の成功実行を探す
        target_run = None
        for run in all_runs:
            if str(run.get('job_id')) == str(DBT_CLOUD_JOB_ID) and run.get('status') == 10:
                target_run = run
                break
        
        if not target_run:
            raise Exception(f"ジョブ {DBT_CLOUD_JOB_ID} の成功実行が見つかりません")
    
    run_id = target_run['id']
    print(f"使用する実行ID: {run_id}")
    
    # 必要なアーティファクトを取得
    artifacts = ['catalog.json', 'manifest.json', 'index.html']
    uploaded_files = []
    
    for artifact_name in artifacts:
        artifact_url = f'{base_url}accounts/{DBT_CLOUD_ACCOUNT_ID}/runs/{run_id}/artifacts/{artifact_name}'
        
        try:
            print(f"取得中: {artifact_name}")
            req = request.Request(artifact_url, headers=headers)
            
            with request.urlopen(req) as response:
                content = response.read()
                
                # S3に保存
                s3_key = f"{S3_KEY_PREFIX}{artifact_name}"
                content_type = get_content_type(artifact_name)
                save_to_s3(content, S3_BUCKET_NAME, s3_key, content_type)
                uploaded_files.append(artifact_name)
                print(f"S3アップロード完了: {artifact_name}")
                
        except Exception as e:
            print(f"エラー: {artifact_name} - {str(e)}")
            # catalog.json と manifest.json は必須
            if artifact_name in ['catalog.json', 'manifest.json']:
                raise
    
    return uploaded_files

def lambda_handler(event, context):
    try:
        print("=== dbt docs取得開始 ===")
        uploaded_files = get_project_artifacts()
        
        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': 'dbt docs files uploaded to S3 successfully!',
                'uploaded_files': uploaded_files,
                's3_bucket': S3_BUCKET_NAME,
                's3_prefix': S3_KEY_PREFIX,
                'timestamp': datetime.now().isoformat()
            })
        }
        
    except Exception as e:
        print(f"エラー: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': str(e),
                'timestamp': datetime.now().isoformat()
            })
        }

AWS設定

1. S3バケット作成

aws s3 mb s3://my-dbt-docs-bucket

2. Lambda関数の作成

Runtime: Python 3.9以上
タイムアウト: 30秒
メモリ: 256MB

3. IAMロール設定

Lambda実行ロールに以下のポリシーを追加：

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::my-dbt-docs-bucket/*"
        }
    ]
}

4. 動作確認

S3コンソールで以下のファイルが保存されていることを確認：

dbt-docs/catalog.json - dbtモデルのメタデータ（~150KB）
dbt-docs/manifest.json - dbtプロジェクト構造（~1.4MB）
dbt-docs/index.html - dbt docsのHTML（~1.7MB）

S3の直接URLでファイルにアクセス可能：

https://my-dbt-docs-bucket.s3.amazonaws.com/dbt-docs/index.html

重要: 実際の本番環境では、API TokenやAccount IDなどの機密情報は環境変数やAWS Secrets Managerで管理してください。

自動化設定

EventBridgeでスケジュール実行

{
  "ScheduleExpression": "cron(0 2 * * ? *)",
  "Targets": [
    {
      "Id": "1",
      "Arn": "arn:aws:lambda:region:account:function:dbt-docs-sync"
    }
  ]
}

毎日午前2時に実行される設定例です。

トラブルシューティング

よくある問題

404 Not Found
- Job IDが正しいか確認
- dbt docs generateが実行されているか確認
401 Unauthorized
- API Tokenが有効か確認
- Token権限が適切か確認
アーティファクトが空
- 最新の実行が成功しているか確認
- ジョブ設定で「Generate docs on run」がONか確認

デバッグ用コード

# 利用可能な実行一覧を確認
def debug_runs():
    runs_url = f'{base_url}accounts/{DBT_CLOUD_ACCOUNT_ID}/runs/'
    req = request.Request(runs_url, headers=headers)
    with request.urlopen(req) as response:
        data = json.loads(response.read().decode())
        for run in data['data'][:5]:
            print(f"Run ID: {run['id']}, Job ID: {run['job_id']}, Status: {run['status']}")

注意: 上記のサンプルコードのID類はすべてダミー値です。実際の値に置き換えて使用してください。

運用のポイント

セキュリティ

API TokenはAWS Secrets Managerで管理
S3バケットへのアクセス制限
CloudFrontでのアクセス制御（必要に応じて）

モニタリング

Lambda関数の実行ログ監視
CloudWatchアラーム設定
S3へのアップロード成功/失敗の通知

パフォーマンス

CloudFrontのキャッシュ設定最適化
不要なアーティファクトの除外
差分更新の検討

まとめ

この方法により、dbt Cloudで生成されるdocsを自動的にS3に保存できます。特に以下のような場面で有効です：

バックアップ目的 - dbt docsの定期バックアップ
外部連携 - 他システムからdbtメタデータを活用
アーカイブ - 過去のdbt docs状態を保存
将来的な活用 - S3を起点としたホスティングやデータ分析

取得されるファイルは本物のdbt docsなので、S3から直接ダウンロードしてローカルで開くことも可能です。また、将来的にCloudFrontやNginxなどでホスティングする基盤としても活用できます。

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up