Qiita Engineer Festa20242024年7月17日まで開催中！

Goの正規表現処理をstringsパッケージで代替してBenchmarkを測ってみる

Last updated at 2024-07-09Posted at 2024-07-09

Goの正規表現を標準の regexp パッケージで処理すると遅い、という話をしばしば聞きます。
これはこちらの記事などで紹介されているように、Goの正規表現で採用しているアルゴリズム(Thompson NFA)の都合上、他のアルゴリズムを採用している言語と比較した時、苦手なケースでは遅くなる、というのが実態のようです。

実際のパフォーマンスは、ユースケースに合わせたBenchmarkテストを書いて測りつつ、他の手法と比較してみるのが良さそうです。
今回は、Go標準の strings パッケージを使ってどの程度代替できるのか、速くなるのか、いくつかのパターンでBenchmarkを測ってみました。

実行環境

go version go1.22.5 darwin/arm64

前提

Benchmarkの試行回数は少ないので、計測結果に多少のブレはあります。
また、文字列のパターンや実行環境によって、計測結果は変わる可能性があります。

実装と検証

4パターンほど検証してみました。

1.固定の文字列で開始するか判定

完全な固定文字列の存在チェックを行ってみました。
文字列の例は何でも良かったのですが、試しにURL形式のものをチェックしてゆきます。

コード
※packageとimport周りは省略

var testsStartWithURL = []struct {
	input    string
	expected bool
}{
	{
		input:    "https://example.com",
		expected: true,
	},
	{
		input:    "https://example.com/some/path/to/contents/example.png",
		expected: true,
	},
	{
		input:    "https://pkg.go.dev/regexp",
		expected: false,
	},
	{
		input:    "foo/https://example.com/bar",
		expected: false,
	},
}

func Benchmark_StartWithURL_Regexp(b *testing.B) {
	r := regexp.MustCompile(`^https://example.com`)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsStartWithURL {
			actual := r.MatchString(tt.input)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

func Benchmark_StartWithURL_Strings(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, tt := range testsStartWithURL {
			actual := strings.HasPrefix(tt.input, "https://example.com")
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

実行結果

Benchmark_StartWithURL_Regexp-8   	 2870473	       402.7 ns/op	       0 B/op	       0 allocs/op
Benchmark_StartWithURL_Strings-8   	75110515	        14.15 ns/op	       0 B/op	       0 allocs/op

軽い素振りのつもりでしたが、思った以上に差が出ました。
stringsパッケージで簡単に置き換え出来る程度の処理なら、代替しない理由はなさそうです。

2.特定のパターンの文字列で開始するか判定

1の亜種です。
パターンは決まっているものの固定ではない、という文字列の存在を判定してみます。
サブドメイン込みのURL文字列をチェックしてゆきます。

コード

var testsStartWithFuzzyURL = []struct {
	input    string
	expected bool
}{
	{
		input:    "https://example.com",
		expected: true,
	},
	{
		input:    "https://subdomain.example.com/some/path/to/contents/example.png",
		expected: true,
	},
	{
		input:    "https://pkg.go.dev/regexp",
		expected: false,
	},
	{
		input:    "foo/https://example.com/bar",
		expected: false,
	},
}

func Benchmark_StartWithFuzzyURL_Regexp(b *testing.B) {
	r := regexp.MustCompile(`^https://([^./]+\.)?example.com`)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsStartWithFuzzyURL {
			actual := r.MatchString(tt.input)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

func Benchmark_StartWithFuzzyURL_Strings(b *testing.B) {
	f := func(s string) bool {
		before, _, found := strings.Cut(s, "example.com")
		if !found {
			return false
		}
		after, found := strings.CutPrefix(before, "https://")
		if !found {
			return false
		}
		if after == "" {
			return true
		}
		subdomain := strings.TrimSuffix(after, ".")
		return !strings.ContainsAny(subdomain, "./")
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsStartWithFuzzyURL {
			actual := f(tt.input)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

実行結果

Benchmark_StartWithFuzzyURL_Regexp-8   	 1399262	       781.7 ns/op	       0 B/op	       0 allocs/op
Benchmark_StartWithFuzzyURL_Strings-8   	20995232	        61.61 ns/op	       0 B/op	       0 allocs/op

完全固定ではなくなっただけで、strings利用の方は急にコード量が増えました。
Cut系、Trim系、Contains系のmethodを使い、文字列の切り出しと判定を行っています。
※もっと良い実装があるかもしれませんが、ご容赦ください
コード量は増えたものの、パフォーマンスは依然としてかなり良さそうです。

3.特定の文字列を置き換える

特定の文字列を1つ、他の文字列に置き換えてみます。
methodの使い比べをしてみたかったので、同じ文字列が複数出てきた場合は考慮していません。
path風の文字列を渡して、その一部分を置き換えてゆきます。

コード

var testsReplace = []struct {
	input    string
	expected string
}{
	{
		input:    "foo/bar/baz/qux",
		expected: "foo/replaced/baz/qux",
	},
	{
		input:    "qux/baz/bar/foo",
		expected: "qux/baz/replaced/foo",
	},
	{
		input:    "foo/foo/baz/qux",
		expected: "foo/foo/baz/qux",
	},
}

func Benchmark_Replace_Regexp(b *testing.B) {
	r := regexp.MustCompile(`bar`)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsReplace {
			actual := r.ReplaceAllString(tt.input, "replaced")
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

func Benchmark_Replace_Strings_1(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, tt := range testsReplace {
			actual := strings.ReplaceAll(tt.input, "bar", "replaced")
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

func Benchmark_Replace_Strings_2(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, tt := range testsReplace {
			actual := strings.Replace(tt.input, "bar", "replaced", 1)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

func Benchmark_Replace_Strings_3(b *testing.B) {
	f := func(s string) string {
		before, after, found := strings.Cut(s, "bar")
		if !found {
			return s
		}
		return before + "replaced" + after
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsReplace {
			actual := f(tt.input)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

実行結果

Benchmark_Replace_Regexp-8   	 2524056	       484.7 ns/op	     241 B/op	      13 allocs/op
Benchmark_Replace_Strings_1-8   	 9975390	       116.1 ns/op	      48 B/op	       2 allocs/op
Benchmark_Replace_Strings_2-8   	 9635331	       113.5 ns/op	      48 B/op	       2 allocs/op
Benchmark_Replace_Strings_3-8   	11256547	        92.57 ns/op	      48 B/op	       2 allocs/op

stringsの方が速い上、メモリ消費も少ない結果でした。
Strings_1はReplaceAll、Strings_2はReplace、Strings_3はCutを使っています。
1つ置換するだけであれば、ReplaceAllとReplaceに有意な差はありませんでした。
CutはReplace系より気持ち速そうですが、コード量が微増する上にReplaceAll的な使い方はできないので、あえて使わなくても良さそうです。

4.特定の文字列をキャプチャする

正規表現でキャプチャグループを使って、文字列を抜き出すパターンです。
簡単な例として、path風の文字列から、ゼロパディングなしの数値文字列をキャプチャしてみます。

コード

var testsCapture = []struct {
	input    string
	expected string
}{
	{
		input:    "foo/0/bar",
		expected: "0",
	},
	{
		input:    "foo/123/bar",
		expected: "123",
	},
	{
		input:    "foo/-100/bar",
		expected: "",
	},
	{
		input:    "foo/001/bar",
		expected: "",
	},
	{
		input:    "foo/1a1/bar",
		expected: "",
	},
	{
		input:    "aaa/foo/111/bar/bbb",
		expected: "",
	},
}

func Benchmark_Capture_Regexp(b *testing.B) {
	r := regexp.MustCompile(`^foo/(0|[1-9][0-9]*)/bar`)
	f := func(s string) string {
		matches := r.FindStringSubmatch(s)
		targetPos := 1
		if len(matches) < targetPos+1 {
			return ""
		}
		return matches[1]
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsCapture {
			actual := f(tt.input)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

func Benchmark_Capture_Strings_1(b *testing.B) {
	f := func(s string) string {
		res := strings.TrimPrefix(s, "foo/")
		res = strings.TrimSuffix(res, "/bar")
		if len(res) >= 2 && strings.HasPrefix(res, "0") {
			return ""
		}
		for _, r := range res {
			if !unicode.IsNumber(r) {
				return ""
			}
		}
		return res
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsCapture {
			actual := f(tt.input)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

func Benchmark_Capture_Strings_2(b *testing.B) {
	f := func(s string) string {
		ss := strings.Split(s, "/")
		capturePos := 1
		if len(ss) < capturePos+1 {
			return ""
		}
		res := ss[capturePos]
		if len(res) >= 2 && strings.HasPrefix(res, "0") {
			return ""
		}
		for _, r := range res {
			if !unicode.IsNumber(r) {
				return ""
			}
		}
		return res
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for _, tt := range testsCapture {
			actual := f(tt.input)
			if actual != tt.expected {
				b.Errorf("input: %v, expected: %v, actual: %v\n", tt.input, tt.expected, actual)
			}
		}
	}
}

実行結果

Benchmark_Capture_Regexp-8   	 1801508	       625.3 ns/op	      64 B/op	       2 allocs/op
Benchmark_Capture_Strings_1-8   	20231496	        59.00 ns/op	       0 B/op	       0 allocs/op
Benchmark_Capture_Strings_2-8   	 3548877	       346.3 ns/op	     320 B/op	       6 allocs/op

strings単品だと数値の判定が煩雑になりそうだったので、unicode.IsNumberを併用しています。
愚直にTrimなどして不要なところを切り取ったパターン(Strings_1)の方では、regexpと比べて速度・メモリ消費ともに良さそうです。
今回はpath風の文字列で / を区切り文字にしていたので、strings.Splitを使って配列にしてから処理するパターン(Strings_2)も試してみましたが、愚直にやったパターンと比べると遅い上、メモリ消費はregexpを使うよりも多くなってしまいました。
regexpの代替を検討する際は、Benchmarkの検証をしないと危うそうです。

まとめ

regexp を strings に代替してみましたが、複雑なパターンでなければ、再現した上で処理速度やメモリ消費量を改善できそうです。
ただ、コードの複雑さが増したり、書き方を誤るとパフォーマンス劣化したり、必ずしもすべての処理を代替できるとは言い切れない結果でした。

とにかく、TestとBenchmarkは偉大です。大事です。
標準でパッケージが用意されていて簡単に書けるので、どんどん書いていきましょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up