More than 5 years have passed since last update.

Digdagの_export時のエラーは_errorのタスクが実行されない

digdag

Last updated at 2019-08-18Posted at 2019-08-18

Digdagワークフローのエラーを検知するためにトップレベルの _error を定義しslackに通知しています。
しかし _export 時にエラーになると _error のタスクが実行されないということがあったのでそれを紹介します。
こちらの記事のサンプルで利用しているDigdagのバージョンはすべて 0.9.39 です。

_errorについて

_error を定義することで、Digdagのタスクが失敗した時に _error に定義されているタスクが実行されます。
以下が例で、task1の中で　sh>: ech "hoge" とわざとエラーを起こしています。

qiita1.dig

+task1:
  sh>: ech "hoge"

_error:
  sh>: echo error!!!!!!!!!

以下が実行例です。一部ログを省略してます。

$ digdag run qiita1.dig
(...略...)
2019-08-18 18:38:09 +0900 [INFO] (main): Starting a new session project id=1 workflow name=qiita1 session_time=2019-08-18T00:00:00+00:00
2019-08-18 18:38:10 +0900 [INFO] (0018@[0:default]+qiita1+task1): sh>: ech "hoge"
/bin/sh: line 1: ech: command not found
2019-08-18 18:38:10 +0900 [ERROR] (0018@[0:default]+qiita1+task1): Task failed with unexpected error: Command failed with code 127
java.lang.RuntimeException: Command failed with code 127
        at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
        at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
        at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:315)
        at io.digdag.cli.Run$OperatorManagerWithSkip.callExecutor(Run.java:705)
        at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:257)
        at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
        at io.digdag.core.agent.LocalWorkspaceManager.withExtractedArchive(LocalWorkspaceManager.java:25)
        at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
        at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
        at io.digdag.cli.Run$OperatorManagerWithSkip.run(Run.java:687)
        at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-08-18 18:38:10 +0900 [INFO] (0018@[0:default]+qiita1^failure-alert): type: notify
2019-08-18 18:38:10 +0900 [INFO] (0020@[0:default]+qiita1^error): sh>: echo error!!!!!!!!!
error!!!!!!!!!
error:
  * +qiita1+task1:
    Command failed with code 127 (runtime)

Task state is saved at /Users/katsuya.tajima/src/github.com/katsuyan/digdag-example/notificaiton/.digdag/status/20190818T000000+0000 directory.
  * Use --session <daily | hourly | "yyyy-MM-dd[ HH:mm:ss]"> to not reuse the last session time.
  * Use --rerun, --start +NAME, or --goal +NAME argument to rerun skipped tasks.

以上のように _error で定義されている sh>: echo error!!!!!!!!! が実行され、error!!!!!!!!! が出力されているのがわかります。

_exportでのエラーは _errorのタスクが実行されない

今度は _export の中でエラーを起こしてやります。変数としてまだ定義されていない huga を _export の中で呼び出します。

qiita2.dig

+task1:
  _export:
    hoge: ${huga}
  sh>: echo "hoge"

_error:
  sh>: echo error!!!!!!!!!

では実行してみましょう。

$ digdag run qiita2.dig
(...略...)
2019-08-18 18:38:43 +0900 [INFO] (main): Starting a new session project id=1 workflow name=qiita2 session_time=2019-08-18T00:00:00+00:00
2019-08-18 18:38:44 +0900 [ERROR] (0018@[0:default]+qiita2+task1): Configuration error at task +qiita2+task1: Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config)
2019-08-18 18:38:44 +0900 [ERROR] (0019@[0:default]+qiita2^error): Configuration error at task +qiita2^error: Failed to evaluate a variable Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config) (ReferenceError: "huga" is not defined) (config)
2019-08-18 18:38:44 +0900 [INFO] (0018@[0:default]+qiita2^failure-alert): type: notify
error:
  * +qiita2+task1:
    Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config)
  * +qiita2^error:
    Failed to evaluate a variable Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config) (ReferenceError: "huga" is not defined) (config)

Task state is saved at /Users/katsuya.tajima/src/github.com/katsuyan/digdag-example/notificaiton/.digdag/status/20190818T000000+0000 directory.
  * Use --session <daily | hourly | "yyyy-MM-dd[ HH:mm:ss]"> to not reuse the last session time.
  * Use --rerun, --start +NAME, or --goal +NAME argument to rerun skipped tasks.

以上のように sh>: echo error!!!!!!!!! が実行されていないのがわかります。
代わりに ReferenceError: "huga" is not defined のエラーが2回発生しているようです。
qiita2^error 実行時にも _export の中身を処理しようとして sh>: echo error!!!!!!!!! の前にエラーとなってしまうようです。

どうやて全体のエラーキャッチをする？

トップレベルの _error だけではワークフロー全体のエラーハンドリングをすることがで来ないことがわかりました。
こちら (https://github.com/treasure-data/digdag/issues/669) のissueを見ていただくと分かる通り、

_error is not reliable enough for notification by its design

とあります。

そこでこちらのissueで紹介されている notification.shell.command を利用します。
以下のようなconfigファイルを用意しqiita2.digを実行してみます。

qiita.conf

notification.type = shell
notification.shell.command = echo "error from config!!!" > error.log

$ digdag run -c qiita.conf qiita2.dig
(...略...)
2019-08-18 19:00:11 +0900 [INFO] (main): Using session /Users/katsuya.tajima/src/github.com/katsuyan/digdag-example/notificaiton/.digdag/status/20190818T000000+0000.
2019-08-18 19:00:12 +0900 [INFO] (main): Starting a new session project id=1 workflow name=qiita2 session_time=2019-08-18T00:00:00+00:00
2019-08-18 19:00:12 +0900 [ERROR] (0018@[0:default]+qiita2+task1): Configuration error at task +qiita2+task1: Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config)
2019-08-18 19:00:13 +0900 [ERROR] (0019@[0:default]+qiita2^error): Configuration error at task +qiita2^error: Failed to evaluate a variable Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config) (ReferenceError: "huga" is not defined) (config)
2019-08-18 19:00:13 +0900 [INFO] (0018@[0:default]+qiita2^failure-alert): type: notify
error:
  * +qiita2+task1:
    Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config)
  * +qiita2^error:
    Failed to evaluate a variable Failed to evaluate a variable ${huga} (ReferenceError: "huga" is not defined) (config) (ReferenceError: "huga" is not defined) (config)

Task state is saved at /Users/katsuya.tajima/src/github.com/katsuyan/digdag-example/notificaiton/.digdag/status/20190818T000000+0000 directory.
  * Use --session <daily | hourly | "yyyy-MM-dd[ HH:mm:ss]"> to not reuse the last session time.
  * Use --rerun, --start +NAME, or --goal +NAME argument to rerun skipped tasks.

$ cat error.log
error from config!!!

ちゃんと echo "error from config!!!" > error.log が実行され error from config!!! が error.log に出力されました。
これを使うことで、workflow内で発生したエラーを確実に通知をできるようになりました。

おまけ

_export エラー時に _error のタスクが実行されないことに関してはissueとしてお伺いを立ててみました。

また、お気づきになったかもしれませんがエラーハンドリングで notification.shell.command をわざわざ echo "error from config!!!" > error.log とリダイレクトしています。
これは、現在 notification.shell.command で呼ばれる stdout/stderr が /dev/null にリダイレクトされるようになっていて echo の出力がされないことを考慮した結果です。

こちらについてもissue化して、ログを出せるように提案しています。

issue: https://github.com/treasure-data/digdag/issues/1204
PR: https://github.com/treasure-data/digdag/pull/1205

また、紹介した #669 のissueにもありましたが、エラーが実際に発生したtask_nameが取れないという問題もあるようです。

まとめ

_export で発生するエラーは _error のタスクが実行されないことを紹介しました
notification.shell.command を利用することでワークフローのエラーキャッチができることを紹介しました

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up