More than 3 years have passed since last update.

株式会社バンダイナムコスタジオ

Go と Rust で実装されたライブラリを最適化してみた

Posted at 2022-07-19

以前に移植した Go と Rust のライブラリを最適化したので手順や結果をまとめてみました。

この記事では pprof などによるプロファイリングをしていません。
最適化の手順や測定方法についてもより良いものがあると思います。

※ ベンチマークの値は記事作成時 (Go 1.18.4, Rust 1.64.0-nightly) のものなのでコミットログと一致していません。

短い結論

今回のケースではメモリ割り当て回数の削減が重要だった。

最適化対象のライブラリ

最適化対象は日本語の文章を与えると適切な改行位置を推測してくれる BudouX というライブラリを Go と Rust に移植したものです。

ベンチマークを追加する

最適化する前に測定手段を用意しなければなりません。
今回は "日本語の文章をいい感じに分割します。" という文章を分割するベンチマークを作成しました。

func BenchmarkParse(b *testing.B) {
	model := models.DefaultJapaneseModel()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		budoux.Parse(model, "日本語の文章をいい感じに分割します。")
	}
}

#[bench]
fn bench_parse(b: &mut Bencher) {
    let model = budoux::models::default_japanese_model();
    b.iter(|| budoux::parse(model, "日本語の文章をいい感じに分割します。"))
}

両方ともテストと同じようにベンチマークを作成でき、コードも似たような感じになっています。

Go は内部関数に対するベンチマークも作成したのですが Rust は公式のベンチマーク機能が nightly でしかサポートされていないため公開関数のみ測定対象にしています。

初期状態での結果 (一部加工しています) は次のような感じです。

(Windows)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8     6723034               176.9 ns/op             8 B/op          2 allocs/op
BenchmarkGetGetFeature-8                  490796              2496 ns/op            1200 B/op         43 allocs/op
BenchmarkParse-8                           16420             73301 ns/op           19182 B/op        832 allocs/op

(Linux)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8     6430723               183.8 ns/op             8 B/op          2 allocs/op
BenchmarkGetGetFeature-8                  445887              2476 ns/op            1200 B/op         43 allocs/op
BenchmarkParse-8                           17287             69469 ns/op           19182 B/op        832 allocs/op

(Windows)
$ cargo +nightly bench
test bench_parse ... bench:     174,294 ns/iter (+/- 19,143)

(Linux)
$ cargo +nightly bench
test bench_parse ... bench:      67,879 ns/iter (+/- 5,158)

Go はあまり差が出なかったのですが Rust は Windows と Linux (Docker コンテナ内での実行) で大きく差が出ました。

Go

最適化はベンチマークの取得回数を 30 回にして結果を benchstat で比較しながら行いました。

文字列リスト生成を省略する

map の key として使う文字列のリストを生成してから対応するスコアを取得していましたがこれをやめて、直接スコアを取得して加算するようにしました。
これによりメモリ割り当て回数が大きく減っています。

(Linux)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8     6083128               183.3 ns/op             8 B/op          2 allocs/op
BenchmarkParse-8                           23545             49793 ns/op            1096 B/op        190 allocs/op

試しに Escape analysis の表示を有効 ¹ ² にしてビルドすると違いが分かります。

変更前のエスケープ分析結果

$ go build -gcflags "-m -l"
# github.com/sg0hsmt/budoux-go
./budoux.go:26:10: leaking param: model
./budoux.go:27:9: &Parser{...} escapes to heap
./parse.go:70:32: in does not escape
./parse.go:76:39: func literal does not escape
./parse.go:80:15: string(v) escapes to heap
./parse.go:80:31: ... argument does not escape
./parse.go:80:32: i escapes to heap
./parse.go:84:17: w1 does not escape
./parse.go:84:21: w2 does not escape
./parse.go:84:25: w3 does not escape
./parse.go:84:29: w4 does not escape
./parse.go:84:33: w5 does not escape
./parse.go:84:37: w6 does not escape
./parse.go:84:41: b1 does not escape
./parse.go:84:45: b2 does not escape
./parse.go:84:49: b3 does not escape
./parse.go:84:53: b4 does not escape
./parse.go:84:57: b5 does not escape
./parse.go:84:61: b6 does not escape
./parse.go:84:65: p1 does not escape
./parse.go:84:69: p2 does not escape
./parse.go:84:73: p3 does not escape
./parse.go:85:17: []string{...} escapes to heap
./parse.go:87:10: "UP1:" + p1 escapes to heap
./parse.go:88:10: "UP2:" + p2 escapes to heap
./parse.go:89:10: "UP3:" + p3 escapes to heap
./parse.go:91:15: "BP1:" + p1 + p2 escapes to heap
./parse.go:92:15: "BP2:" + p2 + p3 escapes to heap
./parse.go:94:10: "UW1:" + w1 escapes to heap
./parse.go:95:10: "UW2:" + w2 escapes to heap
./parse.go:96:10: "UW3:" + w3 escapes to heap
./parse.go:97:10: "UW4:" + w4 escapes to heap
./parse.go:98:10: "UW5:" + w5 escapes to heap
./parse.go:99:10: "UW6:" + w6 escapes to heap
./parse.go:101:15: "BW1:" + w2 + w3 escapes to heap
./parse.go:102:15: "BW2:" + w3 + w4 escapes to heap
./parse.go:103:15: "BW3:" + w4 + w5 escapes to heap
./parse.go:105:20: "TW1:" + w1 + w2 + w3 escapes to heap
./parse.go:106:20: "TW2:" + w2 + w3 + w4 escapes to heap
./parse.go:107:20: "TW3:" + w3 + w4 + w5 escapes to heap
./parse.go:108:20: "TW4:" + w4 + w5 + w6 escapes to heap
./parse.go:110:10: "UB1:" + b1 escapes to heap
./parse.go:111:10: "UB2:" + b2 escapes to heap
./parse.go:112:10: "UB3:" + b3 escapes to heap
./parse.go:113:10: "UB4:" + b4 escapes to heap
./parse.go:114:10: "UB5:" + b5 escapes to heap
./parse.go:115:10: "UB6:" + b6 escapes to heap
./parse.go:117:15: "BB1:" + b2 + b3 escapes to heap
./parse.go:118:15: "BB2:" + b3 + b4 escapes to heap
./parse.go:119:15: "BB3:" + b4 + b5 escapes to heap
./parse.go:121:20: "TB1:" + b1 + b2 + b3 escapes to heap
./parse.go:122:20: "TB2:" + b2 + b3 + b4 escapes to heap
./parse.go:123:20: "TB3:" + b3 + b4 + b5 escapes to heap
./parse.go:124:20: "TB4:" + b4 + b5 + b6 escapes to heap
./parse.go:126:15: "UQ1:" + p1 + b1 escapes to heap
./parse.go:127:15: "UQ2:" + p2 + b2 escapes to heap
./parse.go:128:15: "UQ3:" + p3 + b3 escapes to heap
./parse.go:130:20: "BQ1:" + p2 + b2 + b3 escapes to heap
./parse.go:131:20: "BQ2:" + p2 + b3 + b4 escapes to heap
./parse.go:132:20: "BQ3:" + p3 + b2 + b3 escapes to heap
./parse.go:133:20: "BQ4:" + p3 + b3 + b4 escapes to heap
./parse.go:135:25: "TQ1:" + p2 + b1 + b2 + b3 escapes to heap
./parse.go:136:25: "TQ2:" + p2 + b2 + b3 + b4 escapes to heap
./parse.go:137:25: "TQ3:" + p3 + b1 + b2 + b3 escapes to heap
./parse.go:138:25: "TQ4:" + p3 + b2 + b3 + b4 escapes to heap
./parse.go:15:25: model does not escape
./parse.go:15:38: leaking param: in
./parse.go:16:17: ([]rune)(in) does not escape
./parse.go:18:18: []string{...} escapes to heap
./parse.go:21:17: []string{} escapes to heap
./parse.go:22:15: string(runes[:3]) escapes to heap
./budoux.go:34:7: s does not escape
./budoux.go:34:24: leaking param: in
./parse.go:10:12: model does not escape
./parse.go:10:25: leaking param: in

変更後のエスケープ分析結果

$ go build -gcflags "-m -l"
# github.com/sg0hsmt/budoux-go
./budoux.go:26:10: leaking param: model
./budoux.go:27:9: &Parser{...} escapes to heap
./parse.go:63:32: in does not escape
./parse.go:69:39: func literal does not escape
./parse.go:73:15: string(v) escapes to heap
./parse.go:73:31: ... argument does not escape
./parse.go:73:32: i escapes to heap
./parse.go:77:15: model does not escape
./parse.go:77:28: w1 does not escape
./parse.go:77:32: w2 does not escape
./parse.go:77:36: w3 does not escape
./parse.go:77:40: w4 does not escape
./parse.go:77:44: w5 does not escape
./parse.go:77:48: w6 does not escape
./parse.go:77:52: b1 does not escape
./parse.go:77:56: b2 does not escape
./parse.go:77:60: b3 does not escape
./parse.go:77:64: b4 does not escape
./parse.go:77:68: b5 does not escape
./parse.go:77:72: b6 does not escape
./parse.go:77:76: p1 does not escape
./parse.go:77:80: p2 does not escape
./parse.go:77:84: p3 does not escape
./parse.go:81:23: "UP1:" + p1 does not escape
./parse.go:82:23: "UP2:" + p2 does not escape
./parse.go:83:23: "UP3:" + p3 does not escape
./parse.go:86:26: "BP1:" + p1 + p2 does not escape
./parse.go:87:26: "BP2:" + p2 + p3 does not escape
./parse.go:90:23: "UW1:" + w1 does not escape
./parse.go:91:23: "UW2:" + w2 does not escape
./parse.go:92:23: "UW3:" + w3 does not escape
./parse.go:93:23: "UW4:" + w4 does not escape
./parse.go:94:23: "UW5:" + w5 does not escape
./parse.go:95:23: "UW6:" + w6 does not escape
./parse.go:98:26: "BW1:" + w2 + w3 does not escape
./parse.go:99:26: "BW2:" + w3 + w4 does not escape
./parse.go:100:26: "BW3:" + w4 + w5 does not escape
./parse.go:103:29: "TW1:" + w1 + w2 + w3 does not escape
./parse.go:104:29: "TW2:" + w2 + w3 + w4 does not escape
./parse.go:105:29: "TW3:" + w3 + w4 + w5 does not escape
./parse.go:106:29: "TW4:" + w4 + w5 + w6 does not escape
./parse.go:109:23: "UB1:" + b1 does not escape
./parse.go:110:23: "UB2:" + b2 does not escape
./parse.go:111:23: "UB3:" + b3 does not escape
./parse.go:112:23: "UB4:" + b4 does not escape
./parse.go:113:23: "UB5:" + b5 does not escape
./parse.go:114:23: "UB6:" + b6 does not escape
./parse.go:117:26: "BB1:" + b2 + b3 does not escape
./parse.go:118:26: "BB2:" + b3 + b4 does not escape
./parse.go:119:26: "BB3:" + b4 + b5 does not escape
./parse.go:122:29: "TB1:" + b1 + b2 + b3 does not escape
./parse.go:123:29: "TB2:" + b2 + b3 + b4 does not escape
./parse.go:124:29: "TB3:" + b3 + b4 + b5 does not escape
./parse.go:125:29: "TB4:" + b4 + b5 + b6 does not escape
./parse.go:128:26: "UQ1:" + p1 + b1 does not escape
./parse.go:129:26: "UQ2:" + p2 + b2 does not escape
./parse.go:130:26: "UQ3:" + p3 + b3 does not escape
./parse.go:133:29: "BQ1:" + p2 + b2 + b3 does not escape
./parse.go:134:29: "BQ2:" + p2 + b3 + b4 does not escape
./parse.go:135:29: "BQ3:" + p3 + b2 + b3 does not escape
./parse.go:136:29: "BQ4:" + p3 + b3 + b4 does not escape
./parse.go:139:32: "TQ1:" + p2 + b1 + b2 + b3 does not escape
./parse.go:140:32: "TQ2:" + p2 + b2 + b3 + b4 does not escape
./parse.go:141:32: "TQ3:" + p3 + b1 + b2 + b3 does not escape
./parse.go:142:32: "TQ4:" + p3 + b2 + b3 + b4 does not escape
./parse.go:15:25: model does not escape
./parse.go:15:38: leaking param: in
./parse.go:16:17: ([]rune)(in) does not escape
./parse.go:18:18: []string{...} escapes to heap
./parse.go:21:17: []string{} escapes to heap
./parse.go:22:15: string(runes[:3]) escapes to heap
./budoux.go:34:7: s does not escape
./budoux.go:34:24: leaking param: in
./parse.go:10:12: model does not escape
./parse.go:10:25: leaking param: in

fmt.Sprintf() の代わりに strconv.Itoa() を使う

ユニコードからスコア取得用の文字列に変換する処理で fmt.Sprintf() の代わりに strconv.Itoa() を使うようにしました。
これにより処理速度が向上しています。

(Linux)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8    13663488                80.26 ns/op            8 B/op          2 allocs/op
BenchmarkParse-8                           28521             41188 ns/op            1096 B/op        190 allocs/op

結果文字列の生成に strings.Builder を使う

結果文字列の生成に strings.Builder を使うことでメモリ割り当てを削減しています。
ただし、効果が小さかったうえに最終的に不要な処理になりました。

(Linux)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8    14262181                77.06 ns/op            8 B/op          2 allocs/op
BenchmarkParse-8                           29480             40875 ns/op            1032 B/op        184 allocs/op

計算した特徴量を再利用する

スコア計算に使用する特徴量などを何度も計算しなおしていたので再利用するようにしました。
単純に処理が減っています。

(Linux)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8    14376645                75.50 ns/op            8 B/op          2 allocs/op
BenchmarkParse-8                           34266             34672 ns/op             472 B/op         46 allocs/op

string から []rune への変換をやめる

ユニコードの文字単位で処理するために string から []rune に変換していたのをやめて 1 文字ずつ処理できるようにしました。
これにより入力として与えられた string の部分文字列を使いまわせるようになりメモリ割り当てを削減できました。

(Windows)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8    37341540                31.43 ns/op            0 B/op          0 allocs/op
BenchmarkGetScore-8                       592852              1929 ns/op               0 B/op          0 allocs/op
BenchmarkParse-8                           30303             39460 ns/op             240 B/op          4 allocs/op

(Linux)
$ go test -bench . -benchmem
BenchmarkGetUnicodeBlockAndFeature-8    36291931                31.52 ns/op            0 B/op          0 allocs/op
BenchmarkGetScore-8                       577533              1902 ns/op               0 B/op          0 allocs/op
BenchmarkParse-8                           29968             39297 ns/op             240 B/op          4 allocs/op

※ BenchmarkParse-8 の性能が悪化しているのは別の修正 ³ で入力文字列の 1 文字目から処理するようになった影響です。

エスケープ分析結果

$ go build -gcflags "-m -l"
# github.com/sg0hsmt/budoux-go
./budoux.go:26:10: leaking param: model
./budoux.go:27:9: &Parser{...} escapes to heap
./parse.go:76:32: leaking param: in to result ~r0 level=0
./parse.go:86:39: func literal does not escape
./parse.go:94:15: model does not escape
./parse.go:94:28: w1 does not escape
./parse.go:94:32: w2 does not escape
./parse.go:94:36: w3 does not escape
./parse.go:94:40: w4 does not escape
./parse.go:94:44: w5 does not escape
./parse.go:94:48: w6 does not escape
./parse.go:94:52: b1 does not escape
./parse.go:94:56: b2 does not escape
./parse.go:94:60: b3 does not escape
./parse.go:94:64: b4 does not escape
./parse.go:94:68: b5 does not escape
./parse.go:94:72: b6 does not escape
./parse.go:94:76: p1 does not escape
./parse.go:94:80: p2 does not escape
./parse.go:94:84: p3 does not escape
./parse.go:98:23: "UP1:" + p1 does not escape
./parse.go:99:23: "UP2:" + p2 does not escape
./parse.go:100:23: "UP3:" + p3 does not escape
./parse.go:103:26: "BP1:" + p1 + p2 does not escape
./parse.go:104:26: "BP2:" + p2 + p3 does not escape
./parse.go:107:23: "UW1:" + w1 does not escape
./parse.go:108:23: "UW2:" + w2 does not escape
./parse.go:109:23: "UW3:" + w3 does not escape
./parse.go:110:23: "UW4:" + w4 does not escape
./parse.go:111:23: "UW5:" + w5 does not escape
./parse.go:112:23: "UW6:" + w6 does not escape
./parse.go:115:26: "BW1:" + w2 + w3 does not escape
./parse.go:116:26: "BW2:" + w3 + w4 does not escape
./parse.go:117:26: "BW3:" + w4 + w5 does not escape
./parse.go:120:29: "TW1:" + w1 + w2 + w3 does not escape
./parse.go:121:29: "TW2:" + w2 + w3 + w4 does not escape
./parse.go:122:29: "TW3:" + w3 + w4 + w5 does not escape
./parse.go:123:29: "TW4:" + w4 + w5 + w6 does not escape
./parse.go:126:23: "UB1:" + b1 does not escape
./parse.go:127:23: "UB2:" + b2 does not escape
./parse.go:128:23: "UB3:" + b3 does not escape
./parse.go:129:23: "UB4:" + b4 does not escape
./parse.go:130:23: "UB5:" + b5 does not escape
./parse.go:131:23: "UB6:" + b6 does not escape
./parse.go:134:26: "BB1:" + b2 + b3 does not escape
./parse.go:135:26: "BB2:" + b3 + b4 does not escape
./parse.go:136:26: "BB3:" + b4 + b5 does not escape
./parse.go:139:29: "TB1:" + b1 + b2 + b3 does not escape
./parse.go:140:29: "TB2:" + b2 + b3 + b4 does not escape
./parse.go:141:29: "TB3:" + b3 + b4 + b5 does not escape
./parse.go:142:29: "TB4:" + b4 + b5 + b6 does not escape
./parse.go:145:26: "UQ1:" + p1 + b1 does not escape
./parse.go:146:26: "UQ2:" + p2 + b2 does not escape
./parse.go:147:26: "UQ3:" + p3 + b3 does not escape
./parse.go:150:29: "BQ1:" + p2 + b2 + b3 does not escape
./parse.go:151:29: "BQ2:" + p2 + b3 + b4 does not escape
./parse.go:152:29: "BQ3:" + p3 + b2 + b3 does not escape
./parse.go:153:29: "BQ4:" + p3 + b3 + b4 does not escape
./parse.go:156:32: "TQ1:" + p2 + b1 + b2 + b3 does not escape
./parse.go:157:32: "TQ2:" + p2 + b2 + b3 + b4 does not escape
./parse.go:158:32: "TQ3:" + p3 + b1 + b2 + b3 does not escape
./parse.go:159:32: "TQ4:" + p3 + b2 + b3 + b4 does not escape
./parse.go:18:25: model does not escape
./parse.go:18:38: leaking param: in
./parse.go:20:18: []string{...} escapes to heap
./parse.go:23:17: []string{} escapes to heap
./budoux.go:34:7: s does not escape
./budoux.go:34:24: leaking param: in
./parse.go:13:12: model does not escape
./parse.go:13:25: leaking param: in

Rust

最適化はベンチマークを何度か実行しながら行いました。
Go と異なり手軽な比較ツールがなかったのでちょっと不便でした。

実際の作業も Go で導入した方法を Rust にも導入する感じで進めました。

String::from() の代わりに format! マクロを使う (失敗)

String::from() の代わりに format! マクロを使ってみました。
結果的にこれは失敗 (効果がなかった) だったため revert しています。
が、改めて試したところ Windows では効果があったようです。

(Windows)
$ cargo +nightly bench
test bench_parse ... bench:     134,112 ns/iter (+/- 36,787)

(Linux)
$ cargo +nightly bench
test bench_parse ... bench:      67,605 ns/iter (+/- 5,214)

計算した特徴量を再利用する

スコア計算に使用する特徴量などを何度も計算しなおしていたので再利用するようにしました。
単純に処理が減っています。

(Windows)
$ cargo +nightly bench
test bench_parse ... bench:     157,845 ns/iter (+/- 19,159)

(Linux)
$ cargo +nightly bench
test bench_parse ... bench:      61,095 ns/iter (+/- 2,758)

文字列の連結に concat を使う

文字列を String::from() で &str から String に変換して + で連結していたのを concat で連結するように変更しました。
Windows での性能も大きく改善しています。

(Windows)
$ cargo +nightly bench
test bench_parse ... bench:      79,379 ns/iter (+/- 18,699)

(Linux)
$ cargo +nightly bench
test bench_parse ... bench:      46,592 ns/iter (+/- 6,197)

※ 性能が悪化しているのは別の修正 ⁴ で入力文字列の 1 文字目から処理するようになった影響です。

文字列リスト生成を省略する

map の key として使う文字列のリストを生成してから対応するスコアを取得していましたがこれをやめて、直接スコアを取得して加算するようにしました。
Go では初期に行っていた変更ですが Rust では効果が見えなかったため保留にしていました。
やや性能が落ちているのですが別の修正で必要になるため入れています。

(Windows)
$ cargo +nightly bench
test bench_parse ... bench:      83,100 ns/iter (+/- 2,107)

(Linux)
$ cargo +nightly bench
test bench_parse ... bench:      48,693 ns/iter (+/- 7,338)

文字列の生成に使うバッファを使いまわす

map の key として使う文字列の生成に使うバッファを使いまわすようにしました。
大きく性能改善しているだけでなく Windows と Linux の差も小さくなっています。

(Windows)
$ cargo +nightly bench
test bench_parse ... bench:      34,970 ns/iter (+/- 6,863)

(Linux)
$ cargo +nightly bench
test bench_parse ... bench:      31,618 ns/iter (+/- 1,425)

&str から Vec<char> への変換をやめる

ユニコードの文字単位で処理するために &str から Vec<char> に変換していたのをやめて 1 文字ずつ処理できるようにしました。
これにより入力として与えられた string の部分文字列を使いまわせるようになりメモリ割り当てを削減できました。

(Windows)
$ cargo +nightly bench
test bench_parse ... bench:      30,121 ns/iter (+/- 2,272)

(Linux)
$ cargo +nightly bench
test bench_parse ... bench:      29,091 ns/iter (+/- 1,726)

結果

最適化によって初期状態から大きく処理速度が向上しました。

	before	after
Go (Windows)	73,301 ns/op	39,460 ns/op
Go (Linux)	69,469 ns/op	39,297 ns/op
Rust (Windows)	174,294 ns/iter	30,121 ns/iter
Rust (Linux)	67,879 ns/iter	29,091 ns/iter

メモリ割り当ても大きく削減されました。
(測定結果があるのは Go のみですが Rust も削減されていると思います)

	before	after
allocate byte	19,182 B/op	240 B/op
allocate count	832 allocs/op	4 allocs/op

感想

最初の段階で Rust で書いたコードの性能が Windows と Linux で大きく異なったのが一番の驚きでした。
Go ではあまり差が出ず Rust で大きく差が出たのは両言語の実装方針の違いの表れと思います。

Windows で作業していたせいか Go に比べると Rust はツールの不足を感じるところがありましたが、
これも Go が独自のランタイムを持っている関係でツールを整備せざるを得なかったためかもしれません。

今回は試していませんが Rust はメモリアロケータをより高性能なものに差し替えることもできます。
メモリアロケータによっては割り当て回数などの統計値を取得できるようです。

最適化はまだ不十分かもしれませんが Go と Rust の性能差がどこから来ているのか気になるところです。
あと Rust の文字列処理は難しいですね・・・

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up