More than 3 years have passed since last update.

日本語に対応したencodeURIComponentを作ってみた

Last updated at 2022-03-24Posted at 2021-08-23

はじめに

charCodeAtが返す整数を16進数にして%を付けたものを返すだけの単純なencodeURIComponentが次のプログラムだ。

var my_encodeURIComponent = (str) => str.replaceAll(/./g, (c) => {
	var code_unit_value = c.charCodeAt(0);
	return '%' + code_unit_value.toString(16).toUpperCase();
});

my_encodeURIComponent('?')は%3Fを返し、encodeURIComponent('?')は%3Fを返す。

ところが、my_encodeURIComponent('あ')は%3042を返し、encodeURIComponent('あ')は%E3%81%82を返す。

英字は合うのに、日本語は合っていない。

日本語を合わせる方法

ECMAScript® 2022 Language Specificationの19.2.6.1.2 DecodeのTable 62 (Informative): UTF-8 Encodingsを見たら、Code Unit Valueの値（おそらくcharCodeAtの値）によってビットの表現形式(Representation)を変えるようなことが書いてある。

日本語が合わない理由は、UTF-8 Encodingsを行っていないことが理由と考えた。

UTF-8 Encodings（Table 62の表に書かれた変換）を行って、日本語に対応したencodeURIComponentのプログラムを作ってみた。

var my_encodeURIComponent = (str) => str.replaceAll(/./g, (c) => {
	if (/[A-Za-z0-9-_.!~*'()]/u.test(c)) {
		return c;
	}

	var code_unit_value = c.charCodeAt(0);

	if (code_unit_value <= 0x007F) {
		return '%' + code_unit_value.toString(16).toUpperCase();
	} else if (code_unit_value <= 0x07FF) {
		return '%' + (((code_unit_value & 0x7c0) >> 6) | 0xc0).toString(16).toUpperCase()
			+ '%' + ((code_unit_value & 0x3f) | 0x80).toString(16).toUpperCase();
	} else if (code_unit_value <= 0xD7FF) {
		return '%' + ((code_unit_value & 0xf000) >> 12 | 0xe0).toString(16).toUpperCase()
			+ '%' + ((code_unit_value & 0xfc0) >> 6 | 0x80).toString(16).toUpperCase()
			+ '%' + (code_unit_value & 0x3f | 0x80).toString(16).toUpperCase();
	}

	// 0xD800以降は未対応
	throw new URIError();
});

my_encodeURIComponent('あ')は%E3%81%82を返し、encodeURIComponent('あ')は
%E3%81%82を返す。

UTF-8 Encodingsの表に合わせてビットの表現形式を変えてやると、日本語が合うようになった。

UTF-8 Encodings の Table

Code Unit Value	Representation	1st Octet	2nd Octet	3rd Octet	4th Octet
0x0000 - 0x007F	00000000 0zzzzzzz	0zzzzzzz
0x0080 - 0x07FF	00000yyy yyzzzzzz	110yyyyy	10zzzzzz
0x0800 - 0xD7FF	xxxxyyyy yyzzzzzz	1110xxxx	10yyyyyy	10zzzzzz
0xD800 - 0xDBFF followed by 0xDC00 - 0xDFFF	110110vv vvwwwwxx followed by 110111yy yyzzzzzz	11110uuu	10uuwwww	10xxyyyy	10zzzzzz
0xD800 - 0xDBFF not followed by 0xDC00 - 0xDFFF	causes URIError
0xDC00 - 0xDFFF	causes URIError
0xE000 - 0xFFFF	xxxxyyyy yyzzzzzz	1110xxxx	10yyyyyy	10zzzzzz

UTF-8 Encodingsの表が消えていたので、Internet Archive の ECMAScriptR 2022 Language Specificationより復旧。

encodeURIComponentは日本語を差別している？

あのcharCodeAtが3042だ。%30%42の2文字で表現されるのかと思ったら、UTF-8 Encodingsによって%E3%81%82の3文字に増える。

UTF-8 Encodingsは、英字を短い表現形式に、日本語を長い表現形式にする。

encodeURIComponentは、英字を優遇し、日本語を差別していると思った。

最後に

単純にcharCodeAtを16進数で返すだけのmy_encodeURIComponentにおいて日本語は合っていないことに気づいた。

ECMAScriptR 2022 Language SpecificationのUTF-8 Encodingsの表を見てビットの表現形式を変えたら日本語も合うようになった。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up