やりたいこと
TreasureDataには色々な方法で接続する(tdコマンド、embulkコマンド、Digdagのtd>、Digdagのembulk>)
同じTreasureData接続情報(エンドポイント・APIキー)を複数箇所に重複して書くのはできるだけ避けたい
結果
~/.td/td.confに書いてみたところ、以下の通り有効な箇所と無効な箇所があった
- 有効
- tdコマンド
- Digdagのtd>(ローカルモード)
- 無効
- Digdagのtd>(サーバモード)
- Embulkのembulk-input-tdプラグイン
- Embulkのembulk-output-tdプラグイン
Digdagサーバモードに効かないのは納得が行く
Embulkのtdプラグインに効かないのは残念
改善して頂けると嬉しい
やったこと
作成
エンドポイントにはhttps://
をつけること(つけないとエラー)
$ id
uid=500(vagrant) gid=500(vagrant) groups=500(vagrant) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
$ pwd
/home/vagrant
$ td --version
0.15.0
$ ls .td
ls: cannot access .td: No such file or directory
$ td -e https://<エンドポイント> account
Enter your Treasure Data credentials. For Google SSO user, please see https://docs.treasuredata.com/articles/command-line#google-sso-users
Email: <ユーザ名>
Password (typing will be hidden): <パスワード>
Authenticated successfully.
Use 'td -e https://<エンドポイント> db:create <db_name>' to create a database.
$ ls .td
td.conf
$ cat .td/td.conf
[account]
user = <ユーザ名>
apikey = <APIキー>
endpoint = https://<エンドポイント>
実はvagrant実行時にはこのtd.confを直接作っている
(td accountを実行しても作れるがパスワードをVagrantfileに書くのがAPIキーを書くより抵抗が強かったから)
config.vm.provision "shell", privileged: false, inline: <<-EOT
mkdir ~/.td
echo '[account]' > ~/.td/td.conf
echo 'user = <ユーザ名>' >> ~/.td/td.conf
echo 'apikey = <APIキー>' >> ~/.td/td.conf
echo 'endpoint = https://<エンドポイント>' >> ~/.td/td.conf
EOT
tdコマンド
以下の通り有効
$ cat xxx.sql
SELECT COUNT(*) AS count FROM xxx
$ td query -d xxx -w -q xxx.sql
Job 9999999 is queued.
Use 'td job:show 9999999' to show the status.
queued...
~中略~
Status : success
Result :
+-------+
| count |
+-------+
| 42 |
+-------+
1 row in set
Digdagのtd>(ローカルモード)
以下の通り有効
$ cat xxx.dig
+task1:
td>: xxx.sql
database: xxx
store_last_results: true
+task2:
echo>: ${td.last_results.count}
$ digdag run xxx.dig
2016-10-26 10:56:52 +0900: Digdag v0.8.17
2016-10-26 10:56:55 +0900 [WARN] (main): Using a new session time 2016-10-26T00:00:00+00:00.
2016-10-26 10:56:55 +0900 [INFO] (main): Using session /tmp/test/.digdag/status/20161026T000000+0000.
2016-10-26 10:56:55 +0900 [INFO] (main): Starting a new session project id=1 workflow name=xxx session_time=2016-10-26T00:00:00+00:00
2016-10-26 10:56:58 +0900 [INFO] (0016@+xxx+task1): td>: xxx.sql
2016-10-26 10:56:59 +0900 [INFO] (0016@+xxx+task1): td-client version: 0.7.26
2016-10-26 10:56:59 +0900 [INFO] (0016@+xxx+task1): Logging initialized @6699ms
2016-10-26 10:57:00 +0900 [INFO] (0016@+xxx+task1): td>: xxx.sql
2016-10-26 10:57:01 +0900 [INFO] (0016@+xxx+task1): Started presto job id=9999999:
SELECT COUNT(*) AS count FROM xxx
2016-10-26 10:57:04 +0900 [INFO] (0016@+xxx+task1): td>: xxx.sql
2016-10-26 10:57:06 +0900 [INFO] (0016@+xxx+task2): echo>: 42
42
Success. Task state is saved at /tmp/test/.digdag/status/20161026T000000+0000 directory.
* Use --session <daily | hourly | "yyyy-MM-dd[ HH:mm:ss]"> to not reuse the last session time.
* Use --rerun, --start +NAME, or --goal +NAME argument to rerun skipped tasks.
Digdagのtd>(サーバモード)
以下の通り無効
$ cat ~/.config/digdag/config
client.http.endpoint = http://<DigdagサーバマシンのIPアドレス>:<ポート番号>/
$ digdag push proj1
2016-10-27 12:27:30 +0900: Digdag v0.8.17
Creating .digdag/tmp/archive-7184579153809927090.tar.gz...
Archiving xxx.dig
Archiving xxx.sql
Workflows:
xxx
Uploaded:
id: 10
name: proj1
revision: a5b35e9e-8ae5-4942-af2a-7b0a4ed12c3d
archive type: db
project created at: 2016-10-27T03:27:33Z
revision updated at: 2016-10-27T03:27:33Z
Use `digdag workflows` to show all workflows.
$ digdag start proj1 xxx --session now
2016-10-27 12:28:16 +0900: Digdag v0.8.17
Started a session attempt:
session id: 112
attempt id: 111
uuid: 4ed676b1-01bf-4dee-ba5d-d9b5e032588d
project: proj1
workflow: xxx
session time: 2016-10-27 03:28:19 +0000
retry attempt name:
params: {}
created at: 2016-10-27 12:28:19 +0900
* Use `digdag session 112` to show session status.
* Use `digdag task 111` and `digdag log 111` to show task status and logs.
$ digdag log 111
2016-10-27 12:29:32 +0900: Digdag v0.8.17
2016-10-27 12:28:22.488 +0900 [INFO] (0074@+xxx+task1) io.digdag.core.agent.OperatorManager: td>: xxx.sql
2016-10-27 12:28:23.146 +0900 [INFO] (0074@+xxx+task1) com.treasuredata.client.TDClient: td-client version: 0.7.26
2016-10-27 12:28:23.161 +0900 [ERROR] (0074@+xxx+task1) io.digdag.core.agent.OperatorManager: Configuration error at task +xxx+task1: The 'td.apikey' secret is missing (config)
2016-10-27 12:28:23.971 +0900 [INFO] (0074@+xxx^failure-alert) io.digdag.core.agent.OperatorManager: type: notify
$ ssh <DigdagサーバマシンのIPアドレス> ps -ef|grep digdag
vagrant@<DigdagサーバマシンのIPアドレス>'s password:
vagrant 1524 1 2 12:18 ? 00:00:16 java -XX:+AggressiveOpts -XX:+TieredCompilation -XX:TieredStopAtLevel=1 -Xverify:none -jar /usr/local/bin/digdag server -c /home/vagrant/.config/digdag/config -O /home/vagrant/digdag-server/task-log
$ ssh <DigdagサーバマシンのIPアドレス> cat ~/.td/td.conf
vagrant@<DigdagサーバマシンのIPアドレス>'s password:
[account]
user = <ユーザ名>
apikey = <APIキー>
endpoint = https://<エンドポイント>
$ digdag version
2016-10-27 12:34:41 +0900: Digdag v0.8.17
Client version: 0.8.17
Server version: 0.8.17
クライアント側に~/.td/td.confがあってもそれはサーバでの実行時には見ない
Digdagサーバマシンでサーバプロセスユーザの~/.td/td.confがあってもそれはサーバでの実行時には見ない
サーバには複数ユーザから複数プロジェクトのpushがあり、それらはエンドポイント・APIキーを共通使用すべきでないから、ということか
Embulkのembulk-input-tdプラグイン
まず動く例
エンドポイントにはhttps://
をつけないこと(つけるとエラー)
$ cat input.yml
in:
type: td
apikey: <APIキー>
endpoint: <エンドポイント>
database: xxx
query: SELECT * FROM xxx
out:
type: file
path_prefix: xxx
file_ext: csv
formatter:
type: csv
header_line: true
$ embulk run input.yml
2016-10-27 12:43:01.916 +0900: Embulk v0.8.14
2016-10-27 12:43:06.817 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-td (0.1.0)
2016-10-27 12:43:06.925 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 12:43:06.938 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 12:43:07.006 +0900 [INFO] (0001:transaction): Logging initialized @13514ms
2016-10-27 12:43:07.648 +0900 [INFO] (0001:transaction): Submit a query for database 'xxx': SELECT * FROM xxx
2016-10-27 12:43:08.650 +0900 [INFO] (0001:transaction): Job 8065368 is queued.
2016-10-27 12:43:08.650 +0900 [INFO] (0001:transaction): Confirm that job 8065368 finished
2016-10-27 12:43:14.317 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
2016-10-27 12:43:14.460 +0900 [INFO] (0001:transaction): {done: 0 / 1, running: 0}
2016-10-27 12:43:14.668 +0900 [INFO] (0023:task-0000): Writing local file 'xxx000.00.csv'
2016-10-27 12:43:15.141 +0900 [INFO] (0001:transaction): {done: 1 / 1, running: 0}
2016-10-27 12:43:15.168 +0900 [INFO] (main): Committed.
2016-10-27 12:43:15.170 +0900 [INFO] (main): Next config diff: {"in":{},"out":{}}
Reading configuration file: /home/vagrant/.td/td.conf
とあるのでAPIキー・エンドポイントをtd.confから取得してくれそう
しかしendpointをコメントアウトして実行するとエラー
https://github.com/muga/embulk-input-td#configuration にあるようにデフォルトの api.treasuredata.com に行ってしまったのだろう
$ embulk run input.yml
2016-10-27 12:46:48.242 +0900: Embulk v0.8.14
2016-10-27 12:46:52.945 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-td (0.1.0)
2016-10-27 12:46:53.061 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 12:46:53.068 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 12:46:53.158 +0900 [INFO] (0001:transaction): Logging initialized @13498ms
2016-10-27 12:46:53.743 +0900 [INFO] (0001:transaction): Submit a query for database 'xxx': SELECT * FROM xxx
2016-10-27 12:46:55.023 +0900 [WARN] (0001:transaction): API request failed
java.util.concurrent.ExecutionException: org.eclipse.jetty.client.HttpResponseException: HTTP protocol violation: Authentication challenge without WWW-Authenticate header
at org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118) ~[jetty-client-9.2.2.v20140723.jar:9.2.2.v20140723]
at org.eclipse.jetty.client.util.FutureResponseListener.get(FutureResponseListener.java:101) ~[jetty-client-9.2.2.v20140723.jar:9.2.2.v20140723]
~略~
endpointを戻してapikeyをコメントアウトしてもエラー
https://github.com/muga/embulk-input-td#configuration にapikeyは必須とある
$ embulk run input.yml
2016-10-27 12:49:09.632 +0900: Embulk v0.8.14
2016-10-27 12:49:14.507 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-td (0.1.0)
org.embulk.exec.PartialExecutionException: org.embulk.config.ConfigException: com.fasterxml.jackson.databind.JsonMappingException: Field 'apikey' is required but not set
at [Source: N/A; line: -1, column: -1]
at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(org/embulk/exec/BulkLoader.java:363)
Embulkのembulk-output-tdプラグイン
embulk-input-tdプラグインと結果は同じ
$ cat output_for_guess.yml
in:
type: file
path_prefix: xxx
out:
type: td
apikey: <APIキー>
endpoint: <エンドポイント>
database: xxx
table: xxx2
mode: truncate
$ embulk guess output_for_guess.yml -o output.yml
2016-10-27 13:01:49.899 +0900: Embulk v0.8.14
2016-10-27 13:01:51.835 +0900 [INFO] (0001:guess): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:01:51.839 +0900 [INFO] (0001:guess): Loading files [xxx000.00.csv]
2016-10-27 13:01:52.038 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/gzip from a load path
2016-10-27 13:01:52.062 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/bzip2 from a load path
2016-10-27 13:01:52.102 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/json from a load path
2016-10-27 13:01:52.119 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/csv from a load path
in:
type: file
path_prefix: xxx
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: col1, type: string}
- {name: col2, type: string}
- {name: time, type: long}
out: {type: td, apikey: <APIキー>, endpoint: <エンドポイント>,
database: xxx, table: xxx2, mode: truncate}
Created 'output.yml' file.
$ embulk run output.yml
2016-10-27 13:02:22.364 +0900: Embulk v0.8.14
2016-10-27 13:02:27.395 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-td (0.3.8)
2016-10-27 13:02:27.511 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:02:27.522 +0900 [INFO] (0001:transaction): Loading files [xxx000.00.csv]
2016-10-27 13:02:27.690 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
2016-10-27 13:02:27.803 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 13:02:27.815 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 13:02:27.887 +0900 [INFO] (0001:transaction): Logging initialized @13895ms
2016-10-27 13:02:29.793 +0900 [INFO] (0001:transaction): Using time:long column as the data partitioning key
2016-10-27 13:02:29.796 +0900 [INFO] (0001:transaction): Create bulk_import session embulk_20161027_040227_014000000
2016-10-27 13:02:30.176 +0900 [INFO] (0001:transaction): {done: 0 / 1, running: 0}
2016-10-27 13:02:30.540 +0900 [INFO] (0022:task-0000): {uploading: {rows: 20, size: 1,166 bytes (compressed)}}
2016-10-27 13:02:30.974 +0900 [INFO] (0001:transaction): {done: 1 / 1, running: 0}
2016-10-27 13:02:31.761 +0900 [INFO] (0001:transaction): Performing bulk import session 'embulk_20161027_040227_014000000'
2016-10-27 13:03:12.793 +0900 [INFO] (0001:transaction): job id: 8065734
2016-10-27 13:03:13.262 +0900 [INFO] (0001:transaction): Committing bulk import session 'embulk_20161027_040227_014000000'
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction): valid records: 20
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction): error records: 0
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction): valid parts: 1
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction): error parts: 0
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction): new columns:
2016-10-27 13:03:13.265 +0900 [INFO] (0001:transaction): - col1: string
2016-10-27 13:03:13.266 +0900 [INFO] (0001:transaction): - col2: string
2016-10-27 13:03:20.469 +0900 [INFO] (0001:transaction): Deleting bulk import session 'embulk_20161027_040227_014000000'
2016-10-27 13:03:20.876 +0900 [INFO] (main): Committed.
2016-10-27 13:03:20.877 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"xxx000.00.csv"},"out":{"last_session":"embulk_20161027_040227_014000000"}}
$ vi output.yml # endpoint定義削除
$ embulk run output.yml
2016-10-27 13:07:21.973 +0900: Embulk v0.8.14
2016-10-27 13:07:26.872 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-td (0.3.8)
2016-10-27 13:07:26.993 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:07:27.004 +0900 [INFO] (0001:transaction): Loading files [xxx000.00.csv]
2016-10-27 13:07:27.159 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
2016-10-27 13:07:27.271 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 13:07:27.277 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 13:07:27.345 +0900 [INFO] (0001:transaction): Logging initialized @13797ms
2016-10-27 13:07:29.389 +0900 [WARN] (0001:transaction): API request failed
java.util.concurrent.ExecutionException: org.eclipse.jetty.client.HttpResponseException: HTTP protocol violation: Authentication challenge without WWW-Authenticate header
at org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118) ~[jetty-client-9.2.2.v20140723.jar:9.2.2.v20140723]
~略~
$ vi output.yml # apikey定義削除
$ embulk run output.yml
2016-10-27 13:08:38.456 +0900: Embulk v0.8.14
2016-10-27 13:08:43.483 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-td (0.3.8)
2016-10-27 13:08:43.607 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:08:43.617 +0900 [INFO] (0001:transaction): Loading files [xxx000.00.csv]
2016-10-27 13:08:43.788 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
org.embulk.exec.PartialExecutionException: org.embulk.config.ConfigException: com.fasterxml.jackson.databind.JsonMappingException: Field 'apikey' is required but not set
at [Source: N/A; line: -1, column: -1]
at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(org/embulk/exec/BulkLoader.java:363)