More than 1 year has passed since last update.

[Go]string、Rune、コードポイント、[]byteの差異について

Posted at 2023-08-24

各概念の説明は参考ページにわかりやすく記述されていますのでそちらをご参照ください。
ここでは、概念の違いに注目して理解を深めることや端的に表現することを目的としています。

byte型とstring型について


var scentence string

//スライス作成←scentence[index]はbyte型：1byteごとに区切られる

codePointPre := []byte{scentence[index]}　//[]byteにコードポイントが格納される


//バイト配列へ変換←scentence[index:index+1]はstring型

codePointPre := []byte(scentence[index:index+1]) //[]byteにコードポイントが格納される

1.Goが準拠するUTF-8は可変長エンコーディング。
文字列にインデックスでアクセスしたときに得られるbyte値は1byteごとであるので、scentenceは英数字の場合にのみ対応。マルチバイト（かなカタカナ漢字）にはstringをbyteにキャストして扱うのは容易ではない。

stringと[]byteが1対1対応でないので、相互にエンコード＆デコードしたときに結果が一致しない。一方通行の変換となる。

package main

import (
	"fmt"
)

func main() {
	var s1 string = "あ"
	var b1 []byte = []byte(s1) // byte型に変換
	fmt.Println(b1) // [227 129 130]

	s2 := string(b1) // 文字列型に変換
	fmt.Println(s2) // あ
}

そのためutf8.DecodeRuneInString関数を使用して、UTF-8エンコードされた文字を正しくデコードし、そのUnicodeコードポイント（rune）を返します。

r, size := utf8.DecodeRuneInString(sentence[index:])

Runeとコードポイントについて

コードポイント＝Rune＝int32

string型と[]byteについて

Go言語のstring型は、内部的にはUTF-8エンコーディングされたバイト列として格納

[]byteとRune/コードポイントについて

エンコード、デコードによって変換できる

参考

https://qiita.com/seihmd/items/4a878e7fa340d7963fee
https://zenn.dev/nakabonne/articles/cc0cde2bd94639
https://ferret-plus.com/7006
https://pkg.go.dev/unicode/utf8#DecodeRuneInString

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up