dbt-osmosisとdbt persist_docs でメタデータ管理がらくちんになりそう。でも起点のメタデータはどうするべき？

Last updated at 2025-05-20Posted at 2025-05-20

「SnowflakeのCopilotちゃんを賢く使うには、Table/Viewのカラムにきちんとコメントを入れて、検索されやすいようにしておくことが重要」と聞きまして、
カラムのdescriptionをいい感じに伝播してくれるdbt-osmosisとかを試してみました。

dbt-osmosis を使ってみる

dbtCoreをインストールして適当なモデルを作成しておく。
(dbt-osmosisはdbtCloudで使うのがちょっと大変なので、今回はdbtCore前提。)
dbt-osmosisをインストールする。

dbt_project.ymlにdbt-osmosisの情報を追加する

dbt_project.yml

models:
  your_project_name:
    +dbt-osmosis: "your_project_name.yml" ## モデルの詳細記述用ドキュメントファイル

"your_project_name.yml"という空ファイルを追加する。
dbt run を実行してカラムを確定させた後に、dbt-osmosis を実行する
```
$ dbt run
$ dbt-osmosis yaml refactor
```

"your_project_name.yml" の内容を確認する。

your_project_name.yml

  - name: my_third_dbt_model
    columns:
      - name: ID
        description: The primary key for this table
      - name: TEST
        description: second add にほんご

日本語も出てきたので大丈夫そうです。

dbt persist_docs を使ってみる

"your_project_name.yml"がdbt docs に使えるのはもちろんですが、カラムのコメントにも適用してくれればいいのにな、と思ってたらありました。

dbt_project.yml にpersist_docsを使うぜ、と追記する。

dbt_project.yml

    models:
      your_project_name:
        +dbt-osmosis: "your_project_name.yml"
        +materialized: view
        +persist_docs:
          relation: true
          columns: true

dbt-osmosis を実行後に、再度 dbt run を実行する。
```
$ dbt run
$ dbt-osmosis yaml refactor
$ dbt run
```
作成された Table/Viewを確認する。

descriptionに入ってきました。いい感じです。

起点(sources)のメタデータを取得する方法を知りたい！

dbtといえばsourceのTable/Viewを設定して、これをもとにパイプラインを構築していくわけですが、
「sourceのymlファイルにカラムのdescriptionが入ってれば、全部dbt-osmosisが伝播させていくので楽なのでは？」
と誰しも考えると思うんですよ。でも、sourceのTable/Viewにコメントが入っていても、dbt-osimosisでは拾ってくれなさそう。
(dbt-osmosis yaml documentで hoge_souces.ymlにカラムを書き出してはくれるが、descriptionは''。)
コメントが入っていれば、information_schemaから力技で持ってこれるのではないかと、下記リンクを参照にやってみました。

packages.ymlにcodegenとdbt_expectationsを追加し、パッケージを反映させる。

packages.yml

packages:
  - package: dbt-labs/dbt_utils
    version:  1.3.0
  - package: dbt-labs/codegen
    version: 0.13.1
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

$ dbt deps

macroファイル作成

macros/generate_source.sql

{% macro generate_source(database_name, schema_name, source_name) %}

{% set sql %}
    with "columns" as (
        select '- name: ' || lower(column_name) || '\n            description: "'|| lower(comment) || '"'
            as column_statement,
            table_name,
            column_name
        from {{ database_name }}.information_schema.columns
        where table_schema = '{{ schema_name | upper }}' and table_name not in ('FIVETRAN_AUDIT', 'SCHEMA_MIGRATIONS')
            and lower(column_name) not in ('_fivetran_deleted', '_fivetran_synced')
    ),
    tables as (
        select table_name,
        '\n      - name: ' || lower(table_name) || '\n        columns:' || listagg('\n          ' || column_statement || '\n') within group ( order by column_name ) as table_desc
        from "columns"
        group by table_name
    )

    select listagg(table_desc) within group ( order by table_name )
    from tables;
{% endset %}

{%- call statement('generator', fetch_result=True) -%}
{{ sql }}
{%- endcall -%}

{%- set states=load_result('generator') -%}
{%- set states_data=states['data'] -%}
{%- set states_status=states['response'] -%}

{% set sources_yaml=[] %}
{% do sources_yaml.append('version: 2') %}
{% do sources_yaml.append('') %}
{% do sources_yaml.append('sources:') %}
{% do sources_yaml.append('  - name: ' ~ source_name | lower) %}
{% do sources_yaml.append('    description: "macro generate source"' ) %}
{% do sources_yaml.append('    database: ' ~ database_name | lower) %}
{% do sources_yaml.append('    schema: ' ~ schema_name | lower) %}

{% do sources_yaml.append('    tables:' ~ states_data[0][0] ) %}

{% if execute %}

{% set joined = sources_yaml | join ('\n') %}
{{ log(joined, info=True) }}
{% do return(joined) %}

{% endif %}

{% endmacro %}

macros/properties.yml

version: 2

macros:
  - name: generate_source
    description: A macro to generate source tables and columns from database_name.information_schema.columns
    arguments:
      - name: database_name
        type: string
        description: The database name
      - name: schema_name
        type: string
        description: The schema name
      - name: source_name
        type: string
        description: The name of the source. For example, `raw_accounting`

作成したマクロを実行する

$ dbt run-operation generate_source --args '{"database_name": "your_snowflake_database_name", "schema_name": "your_snowflake_schema_name", "source_name": "source_name_you_want_to_name_with_your_source"}'

実行結果を source.ymlに貼る。

$ (略)
version: 2

sources:
  - name: source_name_you_want_to_name_with_your_source
    description: "macro generate source"
    database: your_snowflake_database_name
    schema: your_snowflake_schema_name
    tables:
      - name: raw_example_1
        columns:
          - name: col1
            description: "col1 column comment"

      - name: raw_example_2
        columns:
          - name: col2_1
            description: "col2_1 column comment"

      - name: raw_example_3
        columns:
          - name: col1
            description: "example_3 col1 column comment"

でも、絶対これよりいい方法があると思う。何か一撃で行けそうなやつがいそう。
何かアイディアをお持ちの方、教えてください。よろしくお願いします。

おまけの疑問 dbt-osmosisのjoinの挙動

dbt-osmosisには、--add-progenitor-to-metaというオプションを指定すると、そのカラムの元モデル（おまえ、どっから来たの？）を追記してくれるのですが、
同一のカラム名をjoinした場合、SELECT文にモデル名を書いても、yamlファイルにはdbt-osmosisの基準でしかメタデータを取ってきてくれない（作成するモデルからの距離？が近い方を取る）ので、ここは自力で書き換えてあげるしかないのでしょうか？
具体的には、

example_join.sql

select r_1.col1
from {{ ref('stg_raw_example_1') }} r_1 
left join {{ ref('stg_raw_example_3') }} r_3 on r_3.col1 = r_1.col1

でr_1の情報を取ってきてほしいのですが、yamlはこうなってしまうんですね。

your_project_name.yml

  - name: example_join
    columns:
      - name: COL1
        meta:
          osmosis_progenitor: source.your_project_name.raw.raw_example_3
        description: example_3 col1 column comment

これは手作業で直していくしかないんでしょうか？？？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up