More than 3 years have passed since last update.

ゼロ幅スペースの罠が凶悪すぎた話

Posted at 2022-03-20

※本記事では実際のコードを抜き出し C#インタラクティブで実行した結果を記載しています。
環境は Microsoft (R) Visual C# インタラクティブコンパイラバージョン 4.1.0-5.22109.6 ()

発端

文字列処理をしているときに遭遇した理解できないバグ
なんじゃこれ

> TimeSpan.Parse("0:00:00")
System.FormatException: 文字列は有効な TimeSpan として認識されませんでした。
  + System.Globalization.TimeSpanParse.TimeSpanResult.SetFailure(System.Globalization.TimeSpanParse.ParseFailureKind, string, object, string)
  + System.Globalization.TimeSpanParse.ProcessTerminal_HM_S_D(ref System.Globalization.TimeSpanParse.TimeSpanRawInfo, System.Globalization.TimeSpanParse.TimeSpanStandardStyles, ref System.Globalization.TimeSpanParse.TimeSpanResult)
  + System.Globalization.TimeSpanParse.ProcessTerminalState(ref System.Globalization.TimeSpanParse.TimeSpanRawInfo, System.Globalization.TimeSpanParse.TimeSpanStandardStyles, ref System.Globalization.TimeSpanParse.TimeSpanResult)
  + System.Globalization.TimeSpanParse.TryParseTimeSpan(string, System.Globalization.TimeSpanParse.TimeSpanStandardStyles, System.IFormatProvider, ref System.Globalization.TimeSpanParse.TimeSpanResult)
  + System.Globalization.TimeSpanParse.Parse(string, System.IFormatProvider)

検証してみる

0:00:00 が TimeSpan として無効なのではと思ったので実験

> TimeSpan.Parse("0:00:00")
[00:00:00]

普通にパースされます。じゃあ違いはなんだ、ということで比べてみると

> "0:00:00" == "0:00:00"
false

WHY C# Compiler!!
冷静に考えると : が怪しいなあと思ったので、それぞれコピペして調べてみます

> (int)':'
58
> (int)':'
58

同じです。こうなったら全文字疑ってみましょう

> "0:00:00".Select(x => (int)x).ToArray()
int[8] { 48, 58, 48, 48, 58, 48, 48, 8203 }
> "0:00:00".Select(x => (int)x).ToArray()
int[7] { 48, 58, 48, 48, 58, 48, 48 }

8203 ってなんじゃ??とりま16進数にして

> 8203.ToString("x")
"200b"

\u200b で検索

ゼロ幅スペース

\u200b はゼロ幅スペースという文字らしいです。見えないけど存在している、そんな文字。
例えばHTMLなので改行できない位置を明示するのに使うらしい。確かに今回の処理対象テキストはWeb由来でした。

`Trim()` では消えない

今回のコードでは前後に空白文字がある可能性は想定されていて、与えられた文字列をTrim() はしていました。ということで string.Trim() のドキュメントを参照すると(いつもどおりのガバガバ機械翻訳だったので英語版を参照します)

Notes to Callers
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the IsWhiteSpace(Char) method). Because of this change, the Trim() method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim() method in the .NET Framework 4and later versions does not remove. In addition, the Trim() method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

つまり、.NET framework 4以降ではゼロ幅スペースは Trim() では消えないということのようです。

解決策

解決策というほどのことでも無いですが string.Replace("\u200b","") 。
~~これだから最近のunicodeは嫌いだ~~

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

ゼロ幅スペースの罠が凶悪すぎた話

発端

検証してみる

ゼロ幅スペース

Trim() では消えない

解決策

`Trim()` では消えない