More than 5 years have passed since last update.

golangで日本語（マルチバイト）の文字列を数える

Last updated at 2014-12-17Posted at 2014-12-16

勢い良くlen()で数えると、byte数になってしまいます。

unicode/utf8 パッケージのRuneCountInString()を使います。
または、[]runeにキャストすることで、len()でも数えられます。

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	str := "Hello, 世界"
	fmt.Printf("len: %d (byte)\n", len(str))
	fmt.Printf("Rune: %d (rune)\n", utf8.RuneCountInString(str))
	fmt.Printf("Rune: %d (rune)\n", len([]rune(str)))
}

len: 13 (byte)
Rune: 9 (rune)
Rune: 9 (rune)

Go Playground

文字数は9文字。len()では13(byte)になり、RuneCountInString()では、期待通り、9が返ってきました。

Hello, まで(半角スペース含む)は、1文字=1byte、日本語は、UTF-8では基本的には3バイトです。ただし、一部のマイナーな漢字は4バイトなので必ずしも3バイトではないです。

ベンチマーク追記

簡単なベンチマークを取ってみました。
上記2つの方法について、ascii/multibyteおよび、文字数(16と32文字)のベンチマークです。

ベンチマークソースはgistにあります。

go test -bench . -benchmem

BenchmarkRunCountInString16ascii	20000000	        79.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkRunCountInString16multi	10000000	       184 ns/op	       0 B/op	       0 allocs/op
BenchmarkRunCountInString32ascii	10000000	       204 ns/op	       0 B/op	       0 allocs/op
BenchmarkRunCountInString32multi	 5000000	       356 ns/op	       0 B/op	       0 allocs/op

BenchmarkCastToRuneArray16ascii	    10000000	       210 ns/op	      64 B/op	       1 allocs/op
BenchmarkCastToRuneArray16multi	     5000000	       312 ns/op	      64 B/op	       1 allocs/op
BenchmarkCastToRuneArray32ascii	     5000000	       362 ns/op	     128 B/op	       1 allocs/op
BenchmarkCastToRuneArray32multi	     5000000	       569 ns/op	     128 B/op	       1 allocs/op

RuneCountInString()のほうが、1.5倍ほど速いです。
ソースを見ると、以下のようになっています。

func RuneCountInString(s string) (n int) {
	for range s {
		n++
	}
	return
}

これだとアロケートは走らないみたいですね。[]runeにキャストするとアロケートされて、その分が遅いんでしょうか。
また、やはりmultiバイトの方が遅かったです。asciiの方が1.5倍ほど速い。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up