JavaScriptにmb_strimwidthを移植していた時に文字幅が合わなくなったので調べてみた。
※取り急ぎ関数の返り値から実体をマップしただけで、PHPの生コードは洗っていないためご注意ください。
結論
数万字単位で仕様と異なる文字幅が返ってくる。
見た感じ明らかにfull widthな文字が1文字分の幅扱いだったり、mb_strwidthによる文字幅判定は結構危ないかも・・・
U+0000 - U+10FFFFの範囲内で"mb_strwidthとして"正しい仕様
character | width |
---|---|
U+0000 - U+10FF | 1 |
U+1100 - U+115F | 2 |
U+1160 - U+11A2 | 1 |
U+11A3 - U+11A7 | 2 |
U+11A8 - U+11F9 | 1 |
U+11FA - U+11FF | 2 |
U+1200 - U+2328 | 1 |
U+2329 - U+232A | 2 |
U+232B - U+2E7F | 1 |
U+2E80 - U+2E99 | 2 |
U+2E9A - U+2E9A | 1 |
U+2E9B - U+2EF3 | 2 |
U+2EF4 - U+2EFF | 1 |
U+2F00 - U+2FD5 | 2 |
U+2FD6 - U+2FEF | 1 |
U+2FF0 - U+2FFB | 2 |
U+2FFC - U+2FFF | 1 |
U+3000 - U+303E | 2 |
U+303F - U+3040 | 1 |
U+3041 - U+3096 | 2 |
U+3097 - U+3098 | 1 |
U+3099 - U+30FF | 2 |
U+3100 - U+3104 | 1 |
U+3105 - U+312D | 2 |
U+312E - U+3130 | 1 |
U+3131 - U+318E | 2 |
U+318F - U+318F | 1 |
U+3190 - U+31BA | 2 |
U+31BB - U+31BF | 1 |
U+31C0 - U+31E3 | 2 |
U+31E4 - U+31EF | 1 |
U+31F0 - U+321E | 2 |
U+321F - U+321F | 1 |
U+3220 - U+3247 | 2 |
U+3248 - U+324F | 1 |
U+3250 - U+32FE | 2 |
U+32FF - U+32FF | 1 |
U+3300 - U+4DBF | 2 |
U+4DC0 - U+4DFF | 1 |
U+4E00 - U+A48C | 2 |
U+A48D - U+A48F | 1 |
U+A490 - U+A4C6 | 2 |
U+A4C7 - U+A95F | 1 |
U+A960 - U+A97C | 2 |
U+A97D - U+ABFF | 1 |
U+AC00 - U+D7A3 | 2 |
U+D7A4 - U+D7AF | 1 |
U+D7B0 - U+D7C6 | 2 |
U+D7C7 - U+D7CA | 1 |
U+D7CB - U+D7FB | 2 |
U+D7FC - U+F8FF | 1 |
U+F900 - U+FAFF | 2 |
U+FB00 - U+FE0F | 1 |
U+FE10 - U+FE19 | 2 |
U+FE1A - U+FE2F | 1 |
U+FE30 - U+FE52 | 2 |
U+FE53 - U+FE53 | 1 |
U+FE54 - U+FE66 | 2 |
U+FE67 - U+FE67 | 1 |
U+FE68 - U+FE6B | 2 |
U+FE6C - U+FF00 | 1 |
U+FF01 - U+FF60 | 2 |
U+FF61 - U+FFDF | 1 |
U+FFE0 - U+FFE6 | 2 |
U+FFE7 - U+1AFFF | 1 |
U+1B000 - U+1B001 | 2 |
U+1B002 - U+1F1FF | 1 |
U+1F200 - U+1F202 | 2 |
U+1F203 - U+1F20F | 1 |
U+1F210 - U+1F23A | 2 |
U+1F23B - U+1F23F | 1 |
U+1F240 - U+1F248 | 2 |
U+1F249 - U+1F24F | 1 |
U+1F250 - U+1F251 | 2 |
U+1F252 - U+1FFFF | 1 |
U+20000 - U+2FFFD | 2 |
U+2FFFE - U+2FFFF | 1 |
U+30000 - U+3FFFD | 2 |
U+3FFFE - U+10FFFF | 1 |
検証バージョン
PHP 5.6.15
PHP Manualにある仕様
PHP: mb_strwidth - Manual
http://php.net/manual/ja/function.mb-strwidth.php
character | width |
---|---|
U+0000 - U+0019 | 0 |
U+0020 - U+1FFF | 1 |
U+2000 - U+FF60 | 2 |
U+FF61 - U+FF9F | 1 |
U+FFA0 - | 2 |
発端
**※**の幅に差分がある
PHPの場合
mb_strwidth('※', 'UTF-8') === 2; // is false
mb_strwidth('※', 'UTF-8') === 1; // is true
JSの場合
var char_code = '※'.charCodeAt(0);
var width = 0;
if (0x0000 <= char_code && char_code <= 0x0019) {
} else if (0x0020 <= char_code && char_code <= 0x1FFF) {
width = 1;
} else if (0x2000 <= char_code && char_code <= 0xFF60) {
width = 2;
} else if (0xFF61 <= char_code && char_code <= 0xFF9F) {
width = 1;
} else if (0xFFA0 <= char_code) {
width = 2;
}
width == 2; // is true
PHP Manualにある仕様に則って作ったのに差分があるじゃないか!
※ '※'は0x203Bのため、U+2000 - U+FF60の範囲に入ります。
実際の値の洗い出し
次のコードで実際の文字幅を洗い出し。
<html>
<head></head>
<body>
<table border="1">
<tbody>
<tr>
<td>start char</td>
<td>code (dex)</td>
<td>range</td>
<td>mb_strwidth</td>
<td>array</td>
</tr>
<?php
function int2utf8($code_point) {
//UTF-16コードポイント内判定
if ($code_point < 0) {
throw new \Exception(sprintf('%1$s is out of range UTF-16 code point (0x000000 - 0x10FFFF)', $code_point));
}
if (0x10FFFF < $code_point) {
throw new \Exception(sprintf('0x%1$X is out of range UTF-16 code point (0x000000 - 0x10FFFF)', $code_point));
}
//サロゲートペア判定
if (0xD800 <= $code_point && $code_point <= 0xDFFF) {
throw new \Exception(sprintf('0x%X is in of range surrogate pair code point (0xD800 - 0xDFFF)', $code_point));
}
//1番目のバイトのみでchr関数が使えるケース
if ($code_point < 0x80) {
return chr($code_point);
}
//2番目のバイトを考慮する必要があるケース
if ($code_point < 0xA0) {
return chr(0xC0 | $code_point >> 6) . chr(0x80 | $code_point & 0x3F);
}
//数値実体参照表記からの変換
return html_entity_decode('&#'. $code_point .';');
}
$stacker = [];
$prev_width = -1;
$prev_dec_char_code = 0;
$start = hexdec('0x0');
$end = hexdec('0x10FFFF');
for ($i = $start;$i <= $end;$i++) {
if (0xD800 <= $i && $i <= 0xDFFF) {
continue;
}
$char = int2utf8($i);
$current_width = mb_strwidth($char, 'UTF-8');
if ($prev_width !== $current_width) {
$stacker[] = [
'char' => $char,
'dec_char_code' => $i,
'width' => $current_width,
];
$prev_dec_char_code = $i;
$prev_width = $current_width;
}
}
$i = 1;
foreach ($stacker as $stack) {
$next = isset($stacker[$i]) ? $stacker[$i] : ['dec_char_code' => $end + 1];
$i++;
?>
<tr>
<td><?= $stack['char'] ?></td>
<td><?= sprintf('0x%X', $stack['dec_char_code']) ?> (<?= $stack['dec_char_code'] ?>)</td>
<td><?= sprintf("U+%X - U+%X", $stack['dec_char_code'], $next['dec_char_code'] - 1); ?></td>
<td><?= $stack['width'] ?></td>
<td><?= sprintf("[hexdec('0x%X'), hexdec('0x%X'), %s],", $stack['dec_char_code'], $next['dec_char_code'] - 1, $stack['width']); ?></td>
</tr>
<?php
}
?>
</tbody>
</table>
</body>
</html>
検証
次のコードで洗い出した値と実際の値に差分が無いか確認。
diff countが0なら問題無し。
<html>
<head></head>
<body>
<?php
function int2utf8($code_point) {
//UTF-16コードポイント内判定
if ($code_point < 0) {
throw new \Exception(sprintf('%1$s is out of range UTF-16 code point (0x000000 - 0x10FFFF)', $code_point));
}
if (0x10FFFF < $code_point) {
throw new \Exception(sprintf('0x%1$X is out of range UTF-16 code point (0x000000 - 0x10FFFF)', $code_point));
}
//サロゲートペア判定
if (0xD800 <= $code_point && $code_point <= 0xDFFF) {
throw new \Exception(sprintf('0x%X is in of range surrogate pair code point (0xD800 - 0xDFFF)', $code_point));
}
//1番目のバイトのみでchr関数が使えるケース
if ($code_point < 0x80) {
return chr($code_point);
}
//2番目のバイトを考慮する必要があるケース
if ($code_point < 0xA0) {
return chr(0xC0 | $code_point >> 6) . chr(0x80 | $code_point & 0x3F);
}
//数値実体参照表記からの変換
return html_entity_decode('&#'. $code_point .';');
}
$spec_list = [
[hexdec('0x0'), hexdec('0x10FF'), 1],
[hexdec('0x1100'), hexdec('0x115F'), 2],
[hexdec('0x1160'), hexdec('0x11A2'), 1],
[hexdec('0x11A3'), hexdec('0x11A7'), 2],
[hexdec('0x11A8'), hexdec('0x11F9'), 1],
[hexdec('0x11FA'), hexdec('0x11FF'), 2],
[hexdec('0x1200'), hexdec('0x2328'), 1],
[hexdec('0x2329'), hexdec('0x232A'), 2],
[hexdec('0x232B'), hexdec('0x2E7F'), 1],
[hexdec('0x2E80'), hexdec('0x2E99'), 2],
[hexdec('0x2E9A'), hexdec('0x2E9A'), 1],
[hexdec('0x2E9B'), hexdec('0x2EF3'), 2],
[hexdec('0x2EF4'), hexdec('0x2EFF'), 1],
[hexdec('0x2F00'), hexdec('0x2FD5'), 2],
[hexdec('0x2FD6'), hexdec('0x2FEF'), 1],
[hexdec('0x2FF0'), hexdec('0x2FFB'), 2],
[hexdec('0x2FFC'), hexdec('0x2FFF'), 1],
[hexdec('0x3000'), hexdec('0x303E'), 2],
[hexdec('0x303F'), hexdec('0x3040'), 1],
[hexdec('0x3041'), hexdec('0x3096'), 2],
[hexdec('0x3097'), hexdec('0x3098'), 1],
[hexdec('0x3099'), hexdec('0x30FF'), 2],
[hexdec('0x3100'), hexdec('0x3104'), 1],
[hexdec('0x3105'), hexdec('0x312D'), 2],
[hexdec('0x312E'), hexdec('0x3130'), 1],
[hexdec('0x3131'), hexdec('0x318E'), 2],
[hexdec('0x318F'), hexdec('0x318F'), 1],
[hexdec('0x3190'), hexdec('0x31BA'), 2],
[hexdec('0x31BB'), hexdec('0x31BF'), 1],
[hexdec('0x31C0'), hexdec('0x31E3'), 2],
[hexdec('0x31E4'), hexdec('0x31EF'), 1],
[hexdec('0x31F0'), hexdec('0x321E'), 2],
[hexdec('0x321F'), hexdec('0x321F'), 1],
[hexdec('0x3220'), hexdec('0x3247'), 2],
[hexdec('0x3248'), hexdec('0x324F'), 1],
[hexdec('0x3250'), hexdec('0x32FE'), 2],
[hexdec('0x32FF'), hexdec('0x32FF'), 1],
[hexdec('0x3300'), hexdec('0x4DBF'), 2],
[hexdec('0x4DC0'), hexdec('0x4DFF'), 1],
[hexdec('0x4E00'), hexdec('0xA48C'), 2],
[hexdec('0xA48D'), hexdec('0xA48F'), 1],
[hexdec('0xA490'), hexdec('0xA4C6'), 2],
[hexdec('0xA4C7'), hexdec('0xA95F'), 1],
[hexdec('0xA960'), hexdec('0xA97C'), 2],
[hexdec('0xA97D'), hexdec('0xABFF'), 1],
[hexdec('0xAC00'), hexdec('0xD7A3'), 2],
[hexdec('0xD7A4'), hexdec('0xD7AF'), 1],
[hexdec('0xD7B0'), hexdec('0xD7C6'), 2],
[hexdec('0xD7C7'), hexdec('0xD7CA'), 1],
[hexdec('0xD7CB'), hexdec('0xD7FB'), 2],
[hexdec('0xD7FC'), hexdec('0xF8FF'), 1],
[hexdec('0xF900'), hexdec('0xFAFF'), 2],
[hexdec('0xFB00'), hexdec('0xFE0F'), 1],
[hexdec('0xFE10'), hexdec('0xFE19'), 2],
[hexdec('0xFE1A'), hexdec('0xFE2F'), 1],
[hexdec('0xFE30'), hexdec('0xFE52'), 2],
[hexdec('0xFE53'), hexdec('0xFE53'), 1],
[hexdec('0xFE54'), hexdec('0xFE66'), 2],
[hexdec('0xFE67'), hexdec('0xFE67'), 1],
[hexdec('0xFE68'), hexdec('0xFE6B'), 2],
[hexdec('0xFE6C'), hexdec('0xFF00'), 1],
[hexdec('0xFF01'), hexdec('0xFF60'), 2],
[hexdec('0xFF61'), hexdec('0xFFDF'), 1],
[hexdec('0xFFE0'), hexdec('0xFFE6'), 2],
[hexdec('0xFFE7'), hexdec('0x1AFFF'), 1],
[hexdec('0x1B000'), hexdec('0x1B001'), 2],
[hexdec('0x1B002'), hexdec('0x1F1FF'), 1],
[hexdec('0x1F200'), hexdec('0x1F202'), 2],
[hexdec('0x1F203'), hexdec('0x1F20F'), 1],
[hexdec('0x1F210'), hexdec('0x1F23A'), 2],
[hexdec('0x1F23B'), hexdec('0x1F23F'), 1],
[hexdec('0x1F240'), hexdec('0x1F248'), 2],
[hexdec('0x1F249'), hexdec('0x1F24F'), 1],
[hexdec('0x1F250'), hexdec('0x1F251'), 2],
[hexdec('0x1F252'), hexdec('0x1FFFF'), 1],
[hexdec('0x20000'), hexdec('0x2FFFD'), 2],
[hexdec('0x2FFFE'), hexdec('0x2FFFF'), 1],
[hexdec('0x30000'), hexdec('0x3FFFD'), 2],
[hexdec('0x3FFFE'), hexdec('0x10FFFF'), 1],
];
$spec = array_shift($spec_list);
$diff_cnt = 0;
?>
<table border="2">
<tr>
<td>char</td>
<td>code (dex)</td>
<td>spec width</td>
<td>mb_strwidth</td>
</tr>
<?php
for ($i = hexdec('0x0000'), $end = hexdec('0x10FFFF');$i <= $end;$i++) {
if (0xD800 <= $i && $i <= 0xDFFF) {
continue;
}
$target_text = int2utf8($i);
if ($spec[1] < $i) {
$spec = array_shift($spec_list);
}
if (mb_strwidth($target_text) !== $spec[2]) {
$diff_cnt++;
?>
<tr>
<td><?= $target_text ?></td>
<td><?= sprintf('0x%X', $i) ?> (<?= $i ?>)</td>
<td><?= $spec[2] ?></td>
<td><?= mb_strwidth($target_text) ?></td>
</tr>
<?php
}
}
?>
</table>
diff count : <?= $diff_cnt; ?>
</body>
</html>