More than 3 years have passed since last update.

BrainPad Advent CalendarAdvent Calendar 2021

@yasudakn(Kunihiro Yasuda)in

BrainPad Inc.

dbtとBQMLを使った高速なMLモデル構築のパイプライン

Last updated at 2021-12-23Posted at 2021-12-23

モチベーション

dbtがDWH/DataMartの整備やデータ品質のテストやリネージが効率的に行えそう。
データマート加工よりもMLOpsの１つの方法として、MLパイプラインを確認してみたい。

BQMLについて

BigQueryのクエリで簡単にモデル作れる機能のことで、BigQuery MLとも言う。

dbtについて

オンプレ版とクラウド版がある。

この辺にアナリティクス・エンジニアリングのワークフローという触れ込みがあり、
今後、アナリティクス・エンジニアのデファクトツールとなるのか気になるところ。

dbtのコンセプトは、"dbt is the T in ELT."

こちらのリンクに背景とか色々書いてあります。

チュートリアルはこちらの記事が参考になりました。

GCPのdataformと似たコンセプトのOSSベースのツール。
いづれにも触れてみての所感を書いている方がいて、ふむふむと読ませて頂きました。

ここからは、実際に手を動かした手順メモ的な内容

dbtインストール

Cloud版では要らないのですが、CLI版で試しました。

この辺でmac版のhomebrewで試した。dbt-bigqueryのみでOK

インストールされたバージョンとプラグインの確認。こんな感じで出ればOK

dbt --version

% dbt --version                                                                                          (git)-[master]
installed version: 1.0.0
   latest version: 1.0.0

Up to date!

Plugins:
  - bigquery: 1.0.0

つまづき易いポイント

バージョンが古いの入っていたり、プラグインがない場合、linkで新しい方に切り替える。アンインストールしてから入れ直す。

dbt profileの設定

方法は4つあり、1.が手軽でおすすめ

oauth via gcloud
oauth token-based
service account file
service account json

dbt_mlのサンプルコードの外観を掴む

参考記事

dbt_mlについて

dbt_mlではモデルの評価結果を新たにテーブルを作ることなく、auditというテーブルにトレーニングする毎に書き込んで、実験結果を並べて見ることを意識した出力にしている。

出力されるテーブルは、次のようなカラムがある。ml.evaluateはないっぽい。

macros/hooks/model_audit.sql

{% macro _audit_table_columns() %}

{% do return ({
    'model': 'string',
    'schema': 'string',
    'created_at': dbt_utils.type_timestamp(),
    'training_info': 'array<struct<training_run int64, iteration int64, loss float64, eval_loss float64, learning_rate float64, duration_ms int64, cluster_info array<struct<centroid_id int64, cluster_radius float64, cluster_size int64>>>>',
    'feature_info': 'array<struct<input string, min float64, max float64, mean float64, median float64, stddev float64, category_count int64, null_count int64>>',
    'weights': 'array<struct<processed_input string, weight float64, category_weights array<struct<category string, weight float64>>>>',
}) %}

{% endmacro %}

post_hookにて、次のようにマクロを設定すると、上記のテーブルにtraining_info, feature_infoなどが追加される。

features.sql

{{
    config(
        materialized='model',
        
        ml_config={
            'model_type': 'logistic_reg',
            'early_stop': true,
            'ls_init_learn_rate': 0.1,
        },
        
        post_hook="{{ dbt_ml.model_audit() }}" 
    )
}}

select * from {{ ref('features') }}

auditテーブル

![スクリーンショット 2021-12-23 22.15.36.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/636318/aebc855b-f764-1cb0-748e-8b9d2baca0d4.png)

dbt実行

dbt_projectの設定

dbt_project.ymlで設定したprofileを指定する。

profile: my-bigquery-db

実行

depsで必要なdbt_utils, dbt_mlを読み込み、seedでデータソースをBigQueryにロードする。
最後にrunで実行する手順になります。
以下は、実行したそれぞれのコマンド出力。

det deps

% dbt deps                                                                                                 (git)-[main]
14:44:40  No profile "dbt_bqml_example" found, continuing with no target
14:44:40  [WARNING]: Deprecated functionality
The `source-paths` config has been renamed to `model-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
14:44:40  [WARNING]: Deprecated functionality
The `data-paths` config has been renamed to `seed-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
14:44:40  Running with dbt=1.0.0
14:44:42  Installing version 0.5.1
14:44:42    Installed from version 0.5.1
14:44:42    Up to date!
14:44:42  Installing version 0.8.0
14:44:43    Installed from version 0.8.0
14:44:43    Up to date!

dbt seed

% dbt seed                                                                                                 (git)-[main]
14:55:06  [WARNING]: Deprecated functionality
The `source-paths` config has been renamed to `model-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
14:55:06  [WARNING]: Deprecated functionality
The `data-paths` config has been renamed to `seed-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
14:55:06  Running with dbt=1.0.0
14:55:06  Found 5 models, 0 tests, 0 snapshots, 0 analyses, 391 macros, 2 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
14:55:06  
14:55:08  
14:55:08  Running 2 on-run-start hooks
14:55:09  1 of 2 START hook: dbt_ml_example.on-run-start.0................................ [RUN]
14:55:09  1 of 2 OK hook: dbt_ml_example.on-run-start.0................................... [OK in 0.00s]
14:55:09  2 of 2 START hook: dbt_ml_example.on-run-start.1................................ [RUN]
14:55:09  2 of 2 OK hook: dbt_ml_example.on-run-start.1................................... [OK in 0.00s]
14:55:09  
14:55:09  Concurrency: 8 threads (target='dev')
14:55:09  
14:55:09  1 of 3 START seed file dbt_ml_example.predict_me................................ [RUN]
14:55:09  2 of 3 START seed file dbt_ml_example.raw_titanic............................... [RUN]
14:55:09  3 of 3 START seed file dbt_ml_example.test...................................... [RUN]
14:55:12  3 of 3 OK loaded seed file dbt_ml_example.test.................................. [INSERT 418 in 2.52s]
14:55:12  1 of 3 OK loaded seed file dbt_ml_example.predict_me............................ [INSERT 7 in 2.71s]
14:55:13  2 of 3 OK loaded seed file dbt_ml_example.raw_titanic........................... [INSERT 891 in 3.88s]
14:55:13  
14:55:13  Finished running 3 seeds, 2 hooks in 6.78s.
14:55:13  
14:55:13  Completed successfully
14:55:13  
14:55:13  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

dbt run

% dbt run                                                                                                  (git)-[main]
02:58:22  [WARNING]: Deprecated functionality
The `source-paths` config has been renamed to `model-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
02:58:22  [WARNING]: Deprecated functionality
The `data-paths` config has been renamed to `seed-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
02:58:23  Running with dbt=1.0.0
02:58:23  Unable to do partial parsing because profile has changed
02:58:23  Unable to do partial parsing because a project config has changed
02:58:24  [WARNING]: Did not find matching node for patch with name 'ml_logreg' in the 'models' section of file 'models/schema.yml'
02:58:24  [WARNING]: Did not find matching node for patch with name 'hparam_tune' in the 'models' section of file 'models/schema.yml'
02:58:24  Found 3 models, 0 tests, 0 snapshots, 0 analyses, 391 macros, 2 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
02:58:24  
02:58:26  
02:58:26  Running 2 on-run-start hooks
02:58:26  1 of 2 START hook: dbt_ml_example.on-run-start.0................................ [RUN]
02:58:26  1 of 2 OK hook: dbt_ml_example.on-run-start.0................................... [OK in 0.00s]
02:58:26  2 of 2 START hook: dbt_ml_example.on-run-start.1................................ [RUN]
02:58:26  2 of 2 OK hook: dbt_ml_example.on-run-start.1................................... [OK in 0.00s]
02:58:26  
02:58:26  Concurrency: 8 threads (target='dev')
02:58:26  
02:58:26  1 of 3 START view model dbt_ml_example.features................................. [RUN]
02:58:27  1 of 3 OK created view model dbt_ml_example.features............................ [OK in 1.03s]
02:58:27  2 of 3 START model model dbt_ml_example.model................................... [RUN]
02:59:00  2 of 3 OK created model model dbt_ml_example.model.............................. [OK in 32.70s]
02:59:00  3 of 3 START table model dbt_ml_example.ml_predicted............................ [RUN]
02:59:02  3 of 3 OK created table model dbt_ml_example.ml_predicted....................... [CREATE TABLE (418.0 rows, 23.8 KB processed) in 2.57s]
02:59:02  
02:59:02  Finished running 1 view model, 1 model model, 1 table model, 2 hooks in 38.76s.
02:59:02  
02:59:02  Completed successfully
02:59:02  
02:59:02  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

モデルの評価結果はこちら

評価結果

auditテーブル

![スクリーンショット 2021-12-23 22.13.51.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/636318/896d365f-fb63-9f61-a045-de52e7ada793.png) ![スクリーンショット 2021-12-23 22.14.11.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/636318/0b6c877f-36ce-5a5f-e18f-787b95cc93f4.png) ![スクリーンショット 2021-12-23 22.14.33.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/636318/8f8ffa27-8b46-e715-5f66-f00a96b8f590.png)

つまづき易いポイント

dbt_mlとのバージョン、サンプルのdbt_mlのバージョンが古いんで最新のdbtに書き換える

packages.yml

packages:
  - package: kristeligt-dagblad/dbt_ml
    version: 0.5.1

gcpのoauthログイン, profilesの設定
profileのbqのデフォルトのロケーション、ロケーションをミスるとdatasetが見つからないとか出るあれ
dbt_project.ymlでprofile切り替え忘れ

ちょっといじって実行してみる

dataformでbqmlを使う記事があったので、試しにdbtでどうなるか試してみました。

題材は、パブリックデータである new_york_taxi_trips で、tip_amount チップ額を回帰問題でとくもの。

特徴量のSQL

features.sqlでselectのする内容変える。

models/features.sql

WITH final AS (
    SELECT
        tip_amount AS label,
        SUBSTR(CAST(pickup_datetime AS string), 12, 2) pickup_datetime,
        SUBSTR(CAST(dropoff_datetime AS string), 12, 2) dropoff_datetime,
        passenger_count,
        trip_distance,
        rate_code,
        fare_amount,
        pickup_location_id,
        dropoff_location_id
    FROM
        `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`
    TABLESAMPLE SYSTEM (20 PERCENT)
)
SELECT * FROM final

MLモデルタイプを変更

models/ml_linear_reg.sql

{{
    config(
        materialized='model', 

        ml_config={ 
            'model_type': 'linear_reg',
            'early_stop': true,
            'ls_init_learn_rate': 0.1
        },
        
        post_hook="{{ dbt_ml.model_audit() }}" 
    )
}}
SELECT * FROM {{ ref('features') }}

この辺に詳しいCreate Modelのリファレンスがあるんで、これ見ていじってみると良い。
https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create

models以下にbqml trainingするために必要なSQLを追加するだけ。
featuresとの依存関係は、プレースホルダのrefキーワードで決めて実行順序を制御してくれてるっぽい。

automl_regressionでは、ml.weightsがサポートされないとエラーがでた。あんまりテストしてないっぽい。試したGoogleブログのAutoML TablesがBigQuery MLで一般提供されたときの記事

models/ml_automl_reg.sql

{{
    config(
        materialized='model', 

        ml_config={ 
            'model_type': 'AUTOML_REGRESSOR',
            'budget_hours': 1.0
        },
        
        post_hook="{{ dbt_ml.model_audit() }}" 
    )
}}
SELECT * FROM {{ ref('features') }}

一応、モデルの評価結果はこちら、

model info

ハイパーパラメータチューニング

サポートされているとREADMEに書いてあるが、audit書き込みでエラーが出ました。

ml.training_infoではなく代わりにml.trials_infoを利用せよとエラーが出る。これもバグっぽい
trials回数が多すぎるとエラーが出る。これもバグっぽい

models/ml_dnn_class_hparam_tune.sql

{{
    config(
        materialized='model',
        ml_config={
            'model_type': 'dnn_classifier',
            'auto_class_weights': true,
            'learn_rate': dbt_ml.hparam_range(0.01, 0.1),
            'early_stop': true,
            'max_iterations': 50,
            'num_trials': 10,
            'max_parallel_trials': 2,
            'dropout': dbt_ml.hparam_candidates([0, 0.1, 0.25, 0.4]),
            'optimizer': dbt_ml.hparam_candidates(['adam', 'sgd'])
        }
    )
}}
SELECT * FROM {{ ref('features') }}

auditの書き込みと予測sqlが出来ずエラーになっている。一応、テーブルに結果は出ていた。

hparam model info

予測SQL

テストデータのテーブルを使う、ml_predictはこんな感じ

models/ml_predicted.sql

{{
    config(
        materialized='table'
    )
}}

WITH predict_features AS (
    SELECT
        tip_amount,
        SUBSTR(CAST(pickup_datetime AS string), 12, 2) pickup_datetime,
        SUBSTR(CAST(dropoff_datetime AS string), 12, 2) dropoff_datetime,
        passenger_count,
        trip_distance,
        rate_code,
        fare_amount,
        pickup_location_id,
        dropoff_location_id
    FROM
        `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018` LIMIT 100
)

SELECT * FROM {{ dbt_ml.predict(ref('ml_logreg'), 'predict_features') }}

予測結果の表はこちら、１レコードにつきクラス毎のスコアがrecord型で出力されている。

bigquery ml_predicted

まとめ

このdbtのパイプラインでモデル構築〜評価までdbt runコマンド一発で高速に回せるのは非常に魅力的です。
SQLで書ける簡単な特徴量エンジニアリングなら、features.sqlに書くだけで済む手軽さが素晴らしい。
最近のBigQueryのリリースで、pythonで実行する複雑なものはCloud Functionsで書いて外部関数として実行できるようなるため、このfeatures.sqlでそんなものを呼び出しても良さそうです。

日々の運用を考えると学習や評価データの期間をパラメータで取って、スケジュール呼び出し可能なcloud版を利用すると手軽そうです。

今後やるかわからないけども

次はdataformとBQMLでも試してみたい
もっと実用的な特徴量設計、モデリングの場合でも試したい
https://github.com/shimacos37/bqml-tutorial

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up