More than 1 year has passed since last update.

dbtでTreasureData

Posted at 2023-12-10

前書き

前職場でpresto/trino開発環境にdbtを導入し、大幅にモデリングの効率をあげることができた。今年の７月、新しい職場でtrinoではなく、TreasureDataが使われているので同じことができるか考えた。
TreasureDataについてゼロから勉強するので、まず本当の基礎のところからdbtとの相性を確認し、今後本格的に展開できるかどうか考えたいと思う。
初期の検討した結果を共有し、少しでも役に立てれば嬉しいと思う。

そして、結論からいうとと、初期の段階では使えそう！

最初から使えそうと感じたので、TreasureDataがPrestoのクエリエンジンがあるので、trino adapterをそのまま使えそうかなと思っていた。一方、TreasureDataが対応しているのは、trinoよりもprestoであることがわかったが、今後のためprestoアダプターよりもtrinoアダプターにフォーカスしたかった

試した環境

dbtインストールなどを省略し、プロジェクト作成からスタートする

$ dbt --version
Core:
  - installed: 1.7.3
  - latest:    1.7.3 - Up to date!

Plugins:
  - trino: 1.7.0 - Up to date!

試した内容

まず、dbtプロジェクトを作成する

$ dbt init -s test01
01:44:27  Running with dbt=1.7.3
01:44:27
Your new dbt project "test01" was created!

For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:

  https://docs.getdbt.com/docs/configure-your-profile

One more thing:

Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:

  https://community.getdbt.com/

Happy modeling!

以下のプロファイルを使った。host,user, schema以外は、環境に合わせる必要ある

profiles.yml

...
  outputs:
    cdp:
      type: trino
      host: api-presto.treasuredata.co.jp
      port: 443
      method: none
      user: <TreasureData API key>
      password: dummy
      database: td-presto
      schema: lab__test01
      threads: 1
      http_scheme: https
...

上のprofileを用い、接続をためしたところ、問題なく繋がることが確認できた

$ dbt debug --target=cdp  --profile=test01 --profiles-dir=/home/user/.dbt
09:32:16  Running with dbt=1.7.3
09:32:16  dbt version: 1.7.3
09:32:16  python version: 3.8.13
09:32:16  python path: /usr/bin/python3
09:32:16  os info: Linux-5.4.0-162-generic-x86_64-with-glibc2.2.5
09:32:16  Using profiles dir at /home/user/.dbt
09:32:16  Using profiles.yml file at /home/user/.dbt/profiles.yml
09:32:16  Using dbt_project.yml file at /home/user/work/cdp/test01/dbt_project.yml
09:32:16  adapter type: trino
09:32:16  adapter version: 1.7.0
09:32:16  Configuration:
09:32:16    profiles.yml file [OK found and valid]
09:32:16    dbt_project.yml file [OK found and valid]
09:32:16  Required dependencies:
09:32:16   - git [OK found]

09:32:16  Connection:
09:32:16    host: api-presto.treasuredata.co.jp
09:32:16    port: 443
09:32:16    user: <TreasureData API key>
09:32:16    database: td-presto
09:32:16    schema: lab__test01
09:32:16    cert: None
09:32:16    prepared_statements_enabled: True
09:32:16  Registered adapter: trino=1.7.0
09:32:17    Connection test: [OK connection ok]

早速、sampleのモデルを作成してみる。すると、以下のエラーが出た。あまくないですね＾＾

  TrinoUserError(type=USER_ERROR, name=TABLE_NOT_FOUND, message="line 13:27: Table 'system.metadata.materialized_views' does not exist", query_id=20231210_093627_10217_ym6wm)

調べたとろこ、trino-dbt は system.metadataにVIEW情報をアクセスしようとしている。自分の権限が足りない（もしくは、TreasureDataではそもそも system.metadaが使っていない？）かもしれないが、TreasureData自体はViewをサポートしていないので、該当のマクロをオーバライドしてみた。

macrosフォルダーに以下のadapters.sqlを追加し、sampleモデルをviewからtableへ変更した

adapters.sql

macros/adapters.sql

{% macro trino__list_relations_without_caching(relation) %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      t.table_catalog as database,
      t.table_name as name,
      t.table_schema as schema,
      'table' as table_type
    from {{ relation.information_schema() }}.tables t
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}

もう一回、dbt run を実行すると無事にモデル作成ができた

$ dbt  run --target=cdp  --profile=test01 --profiles-dir=/home/user/.dbt
09:42:22  Running with dbt=1.7.3
09:42:23  Registered adapter: trino=1.7.0
09:42:25  Found 2 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 421 macros, 0 groups, 0 semantic models
09:42:25
09:42:28  Concurrency: 1 threads (target='cdp')
09:42:28
09:42:28  1 of 2 START sql table model lab__test01.my_first_dbt_model .................... [RUN]
09:42:30  1 of 2 OK created sql table model lab__test01.my_first_dbt_model ............... [SUCCESS in 2.33s]
09:42:30  2 of 2 START sql table model lab__test01.my_second_dbt_model ................... [RUN]
09:42:32  2 of 2 OK created sql table model lab__test01.my_second_dbt_model .............. [SUCCESS in 2.31s]
09:42:32
09:42:32  Finished running 2 table models in 0 hours 0 minutes and 7.88 seconds (7.88s).
09:42:32
09:42:32  Completed successfully
09:42:32
09:42:32  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

ここまで良さそう。仮に、catalogを作成するため、dbt docs generate を実行すると失敗した。やはり、同じ system.metadataにアクセスしようとした。
もう一つmacroをオーバライドしてみた

catalogs.sql

macros/catalogs.ql

{% macro trino__get_catalog_table_comment_schemas_sql(information_schema, schemas) -%}
    {%- for schema in schemas %}
    select
        table_catalog as "table_database",
        table_schema as "table_schema",
        table_name as "table_name",
        'n/a' as "table_comment"
    from information_schema.tables
    where
        table_catalog = '{{ information_schema.database }}'
        and
        table_schema != 'information_schema'
        and
        table_schema = '{{ schema | lower }}'
    {%- if not loop.last %}
    union all
    {% endif -%}
    {%- endfor -%}
{%- endmacro %}


{% macro trino__get_catalog_table_comment_relations_sql(information_schema, relations) -%}
    {%- for relation in relations %}
    select
        table_catalog as "table_database",
        table_schema as "table_schema",
        table_name as "table_name",
        'n/a' as "table_comment"
    from information_schema.tables
    where
        table_catalog = '{{ information_schema.database }}'
        and
        table_schema != 'information_schema'
        and
        {% if relation.schema and relation.identifier %}
                (
                    table_schema = '{{ relation.schema | lower }}'
                    and table_name = '{{ relation.identifier | lower }}'
                )
            {% elif relation.schema %}
                (
                    table_schema = '{{ relation.schema | lower }}'
                )
            {% else %}
                {% do exceptions.raise_compiler_error(
                    '`get_catalog_relations` requires a list of relations, each with a schema'
                ) %}
        {% endif %}
    {%- if not loop.last %}
    union all
    {% endif -%}
    {%- endfor -%}
{%- endmacro %}

これで、catalogも無事に作成できた

$ dbt docs generate --target=cdp  --profile=test01 --profiles-dir=/home/user/.dbt
09:48:13  Running with dbt=1.7.3
09:48:13  Registered adapter: trino=1.7.0
09:48:15  Found 2 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 423 macros, 0 groups, 0 semantic models
09:48:15
09:48:17  Concurrency: 1 threads (target='cdp')
09:48:17
09:48:17  Building catalog
09:48:18  Catalog written to /home/user/work/cdp/test01/target/catalog.json

ここまで、基本動作が確認できた。

まとめ

adapters.sql, catalog.sql　で既存のmacroをオーバライドすることによって、TreasureDataで接続、モデル作成、カタログ作成が可能ということが確認できた。

今後、仕事に本当に使えそうかどうかもっと調べる必要があるが、第一歩として良さそう。

参考

https://satoshihirose.hateblo.jp/entry/2022/02/26/214456

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up