Help us understand the problem. What is going on with this article?

embulk初歩

More than 3 years have passed since last update.

概要

  • いろんなインプットを元にいろんなアウトプットでバッチデータ転送ができる
  • fluentdのバッチ版
  • プラグインが豊富なので、いろんなインプット、アウトプットに対応

解決したい課題
- データ転送のデータ変換・クレンジングが大変
- エラーハンドリング、再実行・レジュームが大変
- データが増えた場合に、時間ないに終わらず、並列実行対応が大変
- インプット元、アウトプット先が増えたら大変

インストール

本番環境(linux)、macは下記のようにインストールしたほうが良さそう

sudo wget http://dl.embulk.org/embulk-latest.jar -O /usr/local/bin/embulk
sudo chmod +x /usr/local/bin/embulk

チュートリアルでは下記のような記載があるが、~/.embulkにインストールするのは気持ち悪いような気がする・・・。特に本番環境。

curl --create-dirs -o ~/.embulk/bin/embulk -L "http://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

~/.zshrcは使っているシェル次第で読み替える(~/.bashrcとか)

お試し利用

サンプルの作成 いろいろファイルができる

% embulk example ./try1                                                                                                                        [0:00:25]
2016-08-16 00:00:36.506 +0900: Embulk v0.8.13
Creating ./try1 directory...
  Creating ./try1/
  Creating ./try1/csv/
  Creating ./try1/csv/sample_01.csv.gz
  Creating ./try1/seed.yml

Run following subcommands to try embulk:

   1. embulk guess ./try1/seed.yml -o config.yml
   2. embulk preview config.yml
   3. embulk run config.yml

% cd try1

seed.ymlはINPUT、OUTPUTの定義であるconfig.ymlを生成するための種データの定義。
インプットはこのファイルを参考にしてね。アウトプットは標準出力だよ。

seed.yml
in:
  type: file
  path_prefix: "/Users/taga/embulk-sample/try1/csv/sample_"
out:
  type: stdout

seed.ymlを参照して、INPUT、OUTPUTの定義をconfig.ymlにguess(推測)して、作成

% embulk guess ./seed.yml -o config.yml
[0:05:59]
2016-08-16 00:06:08.892 +0900: Embulk v0.8.13
2016-08-16 00:06:10.360 +0900 [INFO] (0001:guess): Listing local files at directory '/Users/taga/embulk-sample/try1/csv' filtering filename by prefix 'sample_'
2016-08-16 00:06:10.365 +0900 [INFO] (0001:guess): Loading files [/Users/taga/embulk-sample/try1/csv/sample_01.csv.gz]
2016-08-16 00:06:10.424 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/gzip from a load path
2016-08-16 00:06:10.438 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/bzip2 from a load path
2016-08-16 00:06:10.459 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/json from a load path
2016-08-16 00:06:10.466 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/csv from a load path
in:
  type: file
  path_prefix: /Users/taga/embulk-sample/try1/csv/sample_
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: stdout}
Created 'config.yml' file.

元となったCSVファイルを見てみると下記ようなデータを元にcolumns等を推測しているのがわかる。treasure dataの勝手にデータ定義をしてくれる的な感じ。

% gunzip -c csv/sample_01.csv.gz                                                                                                               [0:18:10]
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL

プレビュー

% embulk preview config.yml                                                                                                                    [0:18:16]
2016-08-16 00:18:50.917 +0900: Embulk v0.8.13
2016-08-16 00:18:52.494 +0900 [INFO] (0001:preview): Listing local files at directory '/Users/taga/embulk-sample/try1/csv' filtering filename by prefix 'sample_'
2016-08-16 00:18:52.498 +0900 [INFO] (0001:preview): Loading files [/Users/taga/embulk-sample/try1/csv/sample_01.csv.gz]
+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long |          time:timestamp |      purchase:timestamp |             comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
|       1 |       32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC |                     embulk |
|       2 |       14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC |               embulk jruby |
|       3 |       27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
|       4 |       11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC |                            |
+---------+--------------+-------------------------+-------------------------+----------------------------+

実際の実行

% embulk run     config.yml                                                                                                                    [0:18:52]
2016-08-16 00:19:51.474 +0900: Embulk v0.8.13
2016-08-16 00:19:53.602 +0900 [INFO] (0001:transaction): Listing local files at directory '/Users/taga/embulk-sample/try1/csv' filtering filename by prefix 'sample_'
2016-08-16 00:19:53.608 +0900 [INFO] (0001:transaction): Loading files [/Users/taga/embulk-sample/try1/csv/sample_01.csv.gz]
2016-08-16 00:19:53.696 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=8 / output tasks 4 = input tasks 1 * 4
2016-08-16 00:19:53.706 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
2016-08-16 00:19:53.919 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2016-08-16 00:19:53.927 +0900 [INFO] (main): Committed.
2016-08-16 00:19:53.928 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"/Users/taga/embulk-sample/try1/csv/sample_01.csv.gz"},"out":{}}

プラグイン

プラグインはここで検索
http://www.embulk.org/plugins/

こんな感じでインストールできるが、本番環境を考えると、後述するGemfileで管理した方がよい。

% embulk gem install embulk-output-command                                                                                                     [0:19:54]
2016-08-16 00:29:26.465 +0900: Embulk v0.8.13
Fetching: embulk-output-command-0.1.4.gem (100%)
Successfully installed embulk-output-command-0.1.4
1 gem installed

embulk_bundleというディレクトリにプラグインのGemfile等を生成

% embulk mkbundle embulk_bundle                                                                                                                [0:36:19]
2016-08-16 00:36:37.273 +0900: Embulk v0.8.13
Initializing embulk_bundle...
  Creating embulk_bundle/Gemfile
  Creating embulk_bundle/.ruby-version
  Creating embulk_bundle/.bundle/config
  Creating embulk_bundle/embulk/input/example.rb
  Creating embulk_bundle/embulk/output/example.rb
  Creating embulk_bundle/embulk/filter/example.rb
Fetching gem metadata from https://rubygems.org/...............
Fetching version metadata from https://rubygems.org/..
Resolving dependencies...
Using bundler 1.10.6
Installing liquid 3.0.6
Installing msgpack 0.7.6
Installing rjack-icu 4.54.1.1
Installing embulk 0.8.13
Bundle complete! 1 Gemfile dependency, 5 gems now installed.
Bundled gems are installed into ..

例えば、さっきのembulk-output-commandをインストールしたかったら、
embulk_bundle/Gemfileにgem 'embulk-output-command'を追記し、embulk_bundleディレクトリでembulk bundleを実行

% vim embulk_bundle/Gemfile
% cd embulk_bundle
% embulk bundle                                                                                                                                [0:41:54]
2016-08-16 00:42:14.610 +0900: Embulk v0.8.13
Fetching gem metadata from https://rubygems.org/...............
Fetching version metadata from https://rubygems.org/..
Resolving dependencies...
Using bundler 1.10.6
Using liquid 3.0.6
Using msgpack 0.7.6
Using rjack-icu 4.54.1.1
Using embulk 0.8.13
Installing embulk-output-command 0.1.4
Bundle complete! 2 Gemfile dependencies, 6 gems now installed.
Bundled gems are installed into ..

ちょっと面倒だが、さっきのお試し利用のコマンドは下記のように実行する必要がある。

% embulk guess -b ./embulk_bundle ./seed.yml -o config.yml
% embulk preview -b ./embulk_bundle config.yml
% embulk run -b ./embulk_bundle config.yml

レジューム

-r レジュームファイル(yml)のオプションを付ければ、途中で失敗しても、再度同じコマンドを実行すれば、レジューム可能。どこまで、対応できるのだろう・・・(テーブルのコピーの途中で失敗とか・・。X行目まで成功とかレジュームファイルに書きだすのだろうか・・)

embulk run config.yml -r resume-state.yml
m3career
医療従事者及び医療機関向けのHRサービスを展開
https://career.m3career.com/
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away