Help us understand the problem. What is going on with this article?

並列化で3倍以上速くなるembulk v0.8

最後のアドベントカレンダー誰もかかなそうなので、、これからリリースされるembulkの性能について調査した結果をお伝えしたいと思います。

新しいembulkに組み込まれる予定(多分v0.8)のscatter-local-executor機能を手元の環境で試したら、embulk-0.7.10と比較して、フィルタプラグインの速度が3倍速くなりました。

設定はなにも変更をしていませんが、新しいバージョンでは、スレッドを有効に利用してくれるようです。

新しく出るembulkのリリースが楽しみです。

比較結果

embulkバージョン 実行時間 
embulk-0.7.10 62秒
scatter-local-executor 21秒

解説

詳しい説明はこちらのリンクを参照してください。

Local executor plugin

新しいバージョンでは、embulk実行時に次のようなメッセージが表示されます。

2015-12-25 17:17:48.617 +0900 [INFO] (transaction): Using local thread executor with max_threads=48 / output tasks 24 = input tasks 1 * 24

これらの意味はつぎのようになるようです。

  • input tasks 1: inputタスクの数(fileの場合読み込んだファイル数)
  • 24: min_output_tasksの値(初期値: CPUの数)
  • output tasks 24: inputタスクの数 x min_output_tasks
  • max_threads=48: max_threadsの数(初期値: CPUの数x2)

※ 上記CPUの数は、Linuxで言うところの/proc/cpuinfoのprocessorの数です。

検証

検証したハードウェア

  • CPU: E5-2620(6コア、12スレッド)x2

cpuinfo

CPUは24個あるように見えます。

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
stepping        : 4
microcode       : 0x428
cpu MHz         : 2100.000
cache size      : 15360 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips        : 4200.15
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

...

processor       : 23
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
stepping        : 4
microcode       : 0x428
cpu MHz         : 1199.953
cache size      : 15360 KB
physical id     : 1
siblings        : 12
core id         : 5
cpu cores       : 6
apicid          : 43
initial apicid  : 43
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips        : 4205.38
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

検証データ

wc -l aaa.txt
500000 aaa.tsv

embulk-filter-expand_jsonのexampleにある、data.tsvのjsonのフィールドだけ抜き出して50万行のデータを作成しました。

{"phone_numbers":"1-276-220-7263","app_id":0,"point":-1601.6890336884562,"created_at":"2015-10-07 20:23:57 +0900","profile":{"like_words":["maiores","eum","aut"],"anniversary":{"voluptatem":"dolor","et":"ullam"}}}
{"phone_numbers":"553.980.4072","app_id":9,"point":119.36847814392212,"created_at":"2015-10-06 03:43:52 +0900","profile":{"like_words":["nobis","ad","est"],"anniversary":{"rerum":"inventore","deleniti":"numquam"}}}
{"phone_numbers":"267-437-9081","app_id":14,"point":-272.41081832483815,"created_at":"2015-10-07 11:53:53 +0900","profile":{"like_words":["itaque","aut","in"],"anniversary":{"eveniet":"in","id":"sit"}}}
{"phone_numbers":"639.217.7325","app_id":0,"point":3928.1109653627186,"created_at":"2015-10-09 08:43:59 +0900","profile":{"like_words":["a","molestiae","iure"],"anniversary":{"non":"harum","dolorem":"provident"}}}
{"phone_numbers":"590.289.2473","app_id":36,"point":-4153.016432643382,"created_at":"2015-10-06 13:15:47 +0900","profile":{"like_words":["quisquam","quasi","a"],"anniversary":{"ducimus":"veritatis","vel":"in"}}}
{"phone_numbers":"(196) 116-8976","app_id":0,"point":852.1366833369701,"created_at":"2015-10-09 11:06:36 +0900","profile":{"like_words":["maxime","ad","sunt"],"anniversary":{"molestiae":"architecto","temporibus":"quia"}}}
{"phone_numbers":"597.359.0615","app_id":48,"point":-2299.7034880139254,"created_at":"2015-10-07 08:14:01 +0900","profile":{"like_words":["eum","dolor","beatae"],"anniversary":{"fugit":"incidunt","provident":"doloremque"}}}
{"phone_numbers":"1-108-635-0095","app_id":28,"point":3999.8707818258176,"created_at":"2015-10-11 14:53:52 +0900","profile":{"like_words":["ea","minus","sit"],"anniversary":{"accusamus":"voluptate","omnis":"odio"}}}
{"phone_numbers":"282.908.3908","app_id":48,"point":-2690.3089130787753,"created_at":"2015-10-11 21:20:47 +0900","profile":{"like_words":["voluptas","consequatur","occaecati"],"anniversary":{"delectus":"ipsa","dolorem":"qui"}}}
{"phone_numbers":"1-397-425-1652","app_id":63,"point":989.609974671026,"created_at":"2015-10-06 13:00:24 +0900","profile":{"like_words":["facere","optio","veniam"],"anniversary":{"perferendis":"quas","cupiditate":"est"}}}

設定

入力のオーバヘッドがなるべく少なくなるようにembulk-parser-noneを使い、filterプラグインの速度を計測するようにしました。

in:
  type: file
  path_prefix: example/aaa.txt
  parser:
    type: none
    message_key: json_payload

filters:
  - type: expand_json
    json_column_name: json_payload
    root: "$."
    expanded_columns:
      - {name: "phone_numbers", type: string}
      - {name: "app_id", type: long}
      - {name: "point", type: double}
      - {name: "created_at", type: timestamp, format: "%Y-%m-%d"}
      - {name: "profile.anniversary.et", type: string}
      - {name: "profile.anniversary", type: string}
      - {name: "profile.like_words[1]", type: string}
      - {name: "profile.like_words[2]", type: string}
      - {name: "profile.like_words", type: string}

out:
#  type: stdout
  type: "null"

#exec:
#  min_output_tasks: 96

embulk-0.7.10の実行結果

embulk-0.7.10を実行した結果は、62秒でした。

time ../embulk run example/config2.yml 
2015-12-25 17:16:00.083 +0900: Embulk v0.7.10
2015-12-25 17:16:01.447 +0900 [INFO] (transaction): Loaded plugin embulk-filter-expand_json (0.0.4)
2015-12-25 17:16:01.481 +0900 [INFO] (transaction): Loaded plugin embulk-parser-none (0.1.0)
2015-12-25 17:16:01.501 +0900 [INFO] (transaction): Listing local files at directory 'example' filtering filename by prefix 'aaa.txt'
2015-12-25 17:16:01.508 +0900 [INFO] (transaction): Loading files [example/aaa.txt]
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): removed column: name: json_payload, type: string, index: 0
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: phone_numbers, type: string, options: {}, index: 0
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: app_id, type: long, options: {}, index: 1
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: point, type: double, options: {}, index: 2
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: created_at, type: timestamp, options: {"format":"%Y-%m-%d"}, index: 3
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: profile.anniversary.et, type: string, options: {}, index: 4
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: profile.anniversary, type: string, options: {}, index: 5
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: profile.like_words[1], type: string, options: {}, index: 6
2015-12-25 17:16:01.550 +0900 [INFO] (transaction): added column: name: profile.like_words[2], type: string, options: {}, index: 7
2015-12-25 17:16:01.550 +0900 [INFO] (transaction): added column: name: profile.like_words, type: string, options: {}, index: 8
2015-12-25 17:16:01.557 +0900 [INFO] (transaction): {done:  0 / 1, running: 0}
2015-12-25 17:16:59.490 +0900 [INFO] (transaction): {done:  1 / 1, running: 0}
2015-12-25 17:16:59.505 +0900 [INFO] (main): Committed.
2015-12-25 17:16:59.505 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"example/aaa.txt"},"out":{}}

real    1m2.738s
user    1m33.478s
sys 0m2.224s

scatter-local-executor

まったく同じ設定で、scatter-local-executor機能付きのembulkを試したところ、21秒になりました。

time ../embulk-0.7.10.jar run example/config2.yml 
2015-12-25 17:17:47.134 +0900: Embulk v0.7.10
2015-12-25 17:17:48.515 +0900 [INFO] (transaction): Loaded plugin embulk-filter-expand_json (0.0.4)
2015-12-25 17:17:48.549 +0900 [INFO] (transaction): Loaded plugin embulk-parser-none (0.1.0)
2015-12-25 17:17:48.565 +0900 [INFO] (transaction): Listing local files at directory 'example' filtering filename by prefix 'aaa.txt'
2015-12-25 17:17:48.569 +0900 [INFO] (transaction): Loading files [example/aaa.txt]
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): removed column: name: json_payload, type: string, index: 0
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): added column: name: phone_numbers, type: string, options: {}, index: 0
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): added column: name: app_id, type: long, options: {}, index: 1
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): added column: name: point, type: double, options: {}, index: 2
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: created_at, type: timestamp, options: {"format":"%Y-%m-%d"}, index: 3
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.anniversary.et, type: string, options: {}, index: 4
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.anniversary, type: string, options: {}, index: 5
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.like_words[1], type: string, options: {}, index: 6
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.like_words[2], type: string, options: {}, index: 7
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.like_words, type: string, options: {}, index: 8
2015-12-25 17:17:48.617 +0900 [INFO] (transaction): Using local thread executor with max_threads=48 / output tasks 24 = input tasks 1 * 24
2015-12-25 17:17:48.620 +0900 [INFO] (transaction): {done:  0 / 1, running: 0}
2015-12-25 17:18:05.004 +0900 [INFO] (transaction): {done:  1 / 1, running: 0}
2015-12-25 17:18:05.015 +0900 [INFO] (main): Committed.
2015-12-25 17:18:05.016 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"example/aaa.txt"},"out":{}}

real    0m21.205s
user    2m38.686s
sys 0m10.679s
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away