LoginSignup
7
6

More than 5 years have passed since last update.

JRubyからEmbulkEmbedを動かす

Last updated at Posted at 2018-02-11

目的

  • Embulkを複数回連続実行したときの処理時間を削減するためにEmbulkEmbedを利用したい
  • しかし、Java を書きたくはない

JRubyからEmbulkを呼び出してみる

この記事はEmblkの利用でミスマッチをしてしまった話の続きであり、細かいことはそちらに記載してある

Embulk バージョン

embulk --version
embulk 0.9.1

インストール

embulk本体のインストール

公式(https://github.com/embulk/embulk#linux--mac--bsd)

$ curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
$ chmod +x ~/.embulk/bin/embulk
$ echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
$ source ~/.bashrc
$ embulk --version
embulk 0.9.1

Embulk をサンプルで実行してみる

$ embulk example ./work
$ embulk guess work/seed.yml -o config.yml
$ embulk preview config.yml

+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long |          time:timestamp |      purchase:timestamp |             comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
|       1 |       32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC |                     embulk |
|       2 |       14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC |               embulk jruby |
|       3 |       27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
|       4 |       11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC |                            |
+---------+--------------+-------------------------+-------------------------+----------------------------+

EmbulkでCSV出力

config.yml.diff
--- config.yml.orig  2018-02-11 13:57:36.000000000 +0900
+++ config.yml 2018-02-11 14:11:55.000000000 +0900
@@ -21,4 +21,13 @@
     - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
     - {name: purchase, type: timestamp, format: '%Y%m%d'}
     - {name: comment, type: string}
-out: {type: stdout}
+
+out:
+  type: file
+  path_prefix: work/sample_
+  file_ext: csv
+  formatter:
+    type: csv
+
+exec:
+  min_output_tasks: 1
$ embulk run config.yml
$ ls work/*.csv
sample_000.00.csv
$ head -999 work/*.csv
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49.000000 +0000,2015-01-27 00:00:00.000000 +0000,embulk
2,14824,2015-01-27 19:01:23.000000 +0000,2015-01-27 00:00:00.000000 +0000,embulk jruby
3,27559,2015-01-28 02:20:02.000000 +0000,2015-01-28 00:00:00.000000 +0000,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36.000000 +0000,2015-01-29 00:00:00.000000 +0000,

JRuby 実行

jruby をわざわざインストールしたくはないので、embulkに入っている
jruby を呼び出してみる

$ java -classpath ~/.embulk/bin/embulk org.jruby.Main -v -e 'puts "Hello, Embulk"'
jruby 9.1.15.0 (2.3.3) 2017-12-07 929fde8 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [darwin-x86_64]
Hello, Embulk

EmbulkEmbed を JRuby から実行する

embulk_runner.rb
#! ruby

java_import "org.embulk.EmbulkEmbed"

JavaFile = java.io.File

bootstrap = EmbulkEmbed::Bootstrap.new
embulk = bootstrap.initializeCloseable

config_yml = ARGV.shift || "config.yml"

begin
  loader = embulk.newConfigLoader
  config = loader.fromYamlFile JavaFile.new(config_yml)
  embulk.run config
ensure
  embulk.destroy
end
$ java -classpath ~/.embulk/bin/embulk org.jruby.Main embulk_runner.rb config.yml
〜省略〜

$ head -999 work/sample_000.00.csv
〜省略〜

注意: Windowsの場合、javaの文字コードがデフォルトだとMS932になってしまう。これを変更するには、java -Dfile.encoding=UTF-8が必要になる。あるいは、あらかじめ環境変数 set JAVA_TOOLS_OPIONS=-Dfile.encoding=UTF-8
を設定する。

Embulk を連続実行してみる

embulk_runner2.rb
--- embulk_runner.rb    2018-02-11 14:38:19.000000000 +0900
+++ embulk_runner2.rb   2018-02-11 14:35:17.000000000 +0900
@@ -5,14 +5,29 @@
 JavaFile = java.io.File

 bootstrap = EmbulkEmbed::Bootstrap.new
+
+if File.exist?('systemConfig.yml')
+  systemConfig = bootstrap.getSystemConfigLoader().fromYamlFile(JavaFile.new("systemConfig.yml"))
+  bootstrap.setSystemConfig(systemConfig)
+end
+
 embulk = bootstrap.initializeCloseable

 config_yml = ARGV.shift || "config.yml"
+loop_count = (ARGV.shift || 10).to_i

 begin
   loader = embulk.newConfigLoader
   config = loader.fromYamlFile JavaFile.new(config_yml)
-  embulk.run config
+
+  config_in = config.getNested("in")
+  config_out = config.getNested("out")
+  config_out.set("type","file")
+  loop_count.times {|i|
+    config_in.set("path_prefix", "./work/csv/sample_")
+    config_out.set("path_prefix", "./work/sample_output_#{i}_")
+    embulk.run config
+  }
 ensure
   embulk.destroy
 end
embulk_runner2.rb
#! ruby

java_import "org.embulk.EmbulkEmbed"

JavaFile = java.io.File

bootstrap = EmbulkEmbed::Bootstrap.new

if File.exists?('systemConfig.yml')
  systemConfig = bootstrap.getSystemConfigLoader().fromYamlFile(JavaFile.new("systemConfig.yml"))
  bootstrap.setSystemConfig(systemConfig)
end

embulk = bootstrap.initializeCloseable

config_yml = ARGV.shift || "config.yml"
loop_count = (ARGV.shift || 10).to_i

begin
  loader = embulk.newConfigLoader
  config = loader.fromYamlFile JavaFile.new(config_yml)

  config_in = config.getNested("in")
  config_out = config.getNested("out")
  config_out.set("type","file")
  loop_count.times {|i|
    config_in.set("path_prefix", "./work/csv/sample_")
    config_out.set("path_prefix", "./work/sample_output_#{i}_")
    embulk.run config
  }
ensure
  embulk.destroy
end

10回実行

$ time java -classpath ~/.embulk/bin/embulk org.jruby.Main embulk_runner2.rb config.yml 10 >/dev/null

real    0m5.889s
user    0m16.196s
sys     0m0.579s
$ ls -l work/sample_output* | wc -l
     10

100回実行

$ time java -classpath ~/.embulk/bin/embulk org.jruby.Main embulk_runner2.rb config.yml 100 >/dev/null

real    0m7.408s
user    0m21.114s
sys     0m0.814s

$ ls -l work/sample_output* | wc -l
     100

実行回数に依らず十分速い結果を得ることが出来た

7
6
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
7
6