最後のアドベントカレンダー誰もかかなそうなので、、これからリリースされるembulkの性能について調査した結果をお伝えしたいと思います。
新しいembulkに組み込まれる予定(多分v0.8)のscatter-local-executor機能を手元の環境で試したら、embulk-0.7.10と比較して、フィルタプラグインの速度が3倍速くなりました。
設定はなにも変更をしていませんが、新しいバージョンでは、スレッドを有効に利用してくれるようです。
新しく出るembulkのリリースが楽しみです。
比較結果
embulkバージョン | 実行時間 |
---|---|
embulk-0.7.10 | 62秒 |
scatter-local-executor | 21秒 |
解説
詳しい説明はこちらのリンクを参照してください。
新しいバージョンでは、embulk実行時に次のようなメッセージが表示されます。
2015-12-25 17:17:48.617 +0900 [INFO] (transaction): Using local thread executor with max_threads=48 / output tasks 24 = input tasks 1 * 24
これらの意味はつぎのようになるようです。
- input tasks 1: inputタスクの数(fileの場合読み込んだファイル数)
- 24: min_output_tasksの値(初期値: CPUの数)
- output tasks 24: inputタスクの数 x min_output_tasks
- max_threads=48: max_threadsの数(初期値: CPUの数x2)
※ 上記CPUの数は、Linuxで言うところの/proc/cpuinfoのprocessorの数です。
検証
検証したハードウェア
- CPU: E5-2620(6コア、12スレッド)x2
cpuinfo
CPUは24個あるように見えます。
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
stepping : 4
microcode : 0x428
cpu MHz : 2100.000
cache size : 15360 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4200.15
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
...
processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
stepping : 4
microcode : 0x428
cpu MHz : 1199.953
cache size : 15360 KB
physical id : 1
siblings : 12
core id : 5
cpu cores : 6
apicid : 43
initial apicid : 43
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4205.38
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
検証データ
wc -l aaa.txt
500000 aaa.tsv
embulk-filter-expand_jsonのexampleにある、data.tsvのjsonのフィールドだけ抜き出して50万行のデータを作成しました。
{"phone_numbers":"1-276-220-7263","app_id":0,"point":-1601.6890336884562,"created_at":"2015-10-07 20:23:57 +0900","profile":{"like_words":["maiores","eum","aut"],"anniversary":{"voluptatem":"dolor","et":"ullam"}}}
{"phone_numbers":"553.980.4072","app_id":9,"point":119.36847814392212,"created_at":"2015-10-06 03:43:52 +0900","profile":{"like_words":["nobis","ad","est"],"anniversary":{"rerum":"inventore","deleniti":"numquam"}}}
{"phone_numbers":"267-437-9081","app_id":14,"point":-272.41081832483815,"created_at":"2015-10-07 11:53:53 +0900","profile":{"like_words":["itaque","aut","in"],"anniversary":{"eveniet":"in","id":"sit"}}}
{"phone_numbers":"639.217.7325","app_id":0,"point":3928.1109653627186,"created_at":"2015-10-09 08:43:59 +0900","profile":{"like_words":["a","molestiae","iure"],"anniversary":{"non":"harum","dolorem":"provident"}}}
{"phone_numbers":"590.289.2473","app_id":36,"point":-4153.016432643382,"created_at":"2015-10-06 13:15:47 +0900","profile":{"like_words":["quisquam","quasi","a"],"anniversary":{"ducimus":"veritatis","vel":"in"}}}
{"phone_numbers":"(196) 116-8976","app_id":0,"point":852.1366833369701,"created_at":"2015-10-09 11:06:36 +0900","profile":{"like_words":["maxime","ad","sunt"],"anniversary":{"molestiae":"architecto","temporibus":"quia"}}}
{"phone_numbers":"597.359.0615","app_id":48,"point":-2299.7034880139254,"created_at":"2015-10-07 08:14:01 +0900","profile":{"like_words":["eum","dolor","beatae"],"anniversary":{"fugit":"incidunt","provident":"doloremque"}}}
{"phone_numbers":"1-108-635-0095","app_id":28,"point":3999.8707818258176,"created_at":"2015-10-11 14:53:52 +0900","profile":{"like_words":["ea","minus","sit"],"anniversary":{"accusamus":"voluptate","omnis":"odio"}}}
{"phone_numbers":"282.908.3908","app_id":48,"point":-2690.3089130787753,"created_at":"2015-10-11 21:20:47 +0900","profile":{"like_words":["voluptas","consequatur","occaecati"],"anniversary":{"delectus":"ipsa","dolorem":"qui"}}}
{"phone_numbers":"1-397-425-1652","app_id":63,"point":989.609974671026,"created_at":"2015-10-06 13:00:24 +0900","profile":{"like_words":["facere","optio","veniam"],"anniversary":{"perferendis":"quas","cupiditate":"est"}}}
設定
入力のオーバヘッドがなるべく少なくなるようにembulk-parser-noneを使い、filterプラグインの速度を計測するようにしました。
in:
type: file
path_prefix: example/aaa.txt
parser:
type: none
message_key: json_payload
filters:
- type: expand_json
json_column_name: json_payload
root: "$."
expanded_columns:
- {name: "phone_numbers", type: string}
- {name: "app_id", type: long}
- {name: "point", type: double}
- {name: "created_at", type: timestamp, format: "%Y-%m-%d"}
- {name: "profile.anniversary.et", type: string}
- {name: "profile.anniversary", type: string}
- {name: "profile.like_words[1]", type: string}
- {name: "profile.like_words[2]", type: string}
- {name: "profile.like_words", type: string}
out:
# type: stdout
type: "null"
#exec:
# min_output_tasks: 96
embulk-0.7.10の実行結果
embulk-0.7.10を実行した結果は、62秒でした。
time ../embulk run example/config2.yml
2015-12-25 17:16:00.083 +0900: Embulk v0.7.10
2015-12-25 17:16:01.447 +0900 [INFO] (transaction): Loaded plugin embulk-filter-expand_json (0.0.4)
2015-12-25 17:16:01.481 +0900 [INFO] (transaction): Loaded plugin embulk-parser-none (0.1.0)
2015-12-25 17:16:01.501 +0900 [INFO] (transaction): Listing local files at directory 'example' filtering filename by prefix 'aaa.txt'
2015-12-25 17:16:01.508 +0900 [INFO] (transaction): Loading files [example/aaa.txt]
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): removed column: name: json_payload, type: string, index: 0
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: phone_numbers, type: string, options: {}, index: 0
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: app_id, type: long, options: {}, index: 1
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: point, type: double, options: {}, index: 2
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: created_at, type: timestamp, options: {"format":"%Y-%m-%d"}, index: 3
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: profile.anniversary.et, type: string, options: {}, index: 4
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: profile.anniversary, type: string, options: {}, index: 5
2015-12-25 17:16:01.549 +0900 [INFO] (transaction): added column: name: profile.like_words[1], type: string, options: {}, index: 6
2015-12-25 17:16:01.550 +0900 [INFO] (transaction): added column: name: profile.like_words[2], type: string, options: {}, index: 7
2015-12-25 17:16:01.550 +0900 [INFO] (transaction): added column: name: profile.like_words, type: string, options: {}, index: 8
2015-12-25 17:16:01.557 +0900 [INFO] (transaction): {done: 0 / 1, running: 0}
2015-12-25 17:16:59.490 +0900 [INFO] (transaction): {done: 1 / 1, running: 0}
2015-12-25 17:16:59.505 +0900 [INFO] (main): Committed.
2015-12-25 17:16:59.505 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"example/aaa.txt"},"out":{}}
real 1m2.738s
user 1m33.478s
sys 0m2.224s
scatter-local-executor
まったく同じ設定で、scatter-local-executor機能付きのembulkを試したところ、21秒になりました。
time ../embulk-0.7.10.jar run example/config2.yml
2015-12-25 17:17:47.134 +0900: Embulk v0.7.10
2015-12-25 17:17:48.515 +0900 [INFO] (transaction): Loaded plugin embulk-filter-expand_json (0.0.4)
2015-12-25 17:17:48.549 +0900 [INFO] (transaction): Loaded plugin embulk-parser-none (0.1.0)
2015-12-25 17:17:48.565 +0900 [INFO] (transaction): Listing local files at directory 'example' filtering filename by prefix 'aaa.txt'
2015-12-25 17:17:48.569 +0900 [INFO] (transaction): Loading files [example/aaa.txt]
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): removed column: name: json_payload, type: string, index: 0
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): added column: name: phone_numbers, type: string, options: {}, index: 0
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): added column: name: app_id, type: long, options: {}, index: 1
2015-12-25 17:17:48.610 +0900 [INFO] (transaction): added column: name: point, type: double, options: {}, index: 2
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: created_at, type: timestamp, options: {"format":"%Y-%m-%d"}, index: 3
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.anniversary.et, type: string, options: {}, index: 4
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.anniversary, type: string, options: {}, index: 5
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.like_words[1], type: string, options: {}, index: 6
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.like_words[2], type: string, options: {}, index: 7
2015-12-25 17:17:48.611 +0900 [INFO] (transaction): added column: name: profile.like_words, type: string, options: {}, index: 8
2015-12-25 17:17:48.617 +0900 [INFO] (transaction): Using local thread executor with max_threads=48 / output tasks 24 = input tasks 1 * 24
2015-12-25 17:17:48.620 +0900 [INFO] (transaction): {done: 0 / 1, running: 0}
2015-12-25 17:18:05.004 +0900 [INFO] (transaction): {done: 1 / 1, running: 0}
2015-12-25 17:18:05.015 +0900 [INFO] (main): Committed.
2015-12-25 17:18:05.016 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"example/aaa.txt"},"out":{}}
real 0m21.205s
user 2m38.686s
sys 0m10.679s