More than 1 year has passed since last update.

json11で日本語を

Last updated at 2022-12-25Posted at 2020-10-29

軽くて便利なjson11

「ソ」「表」「能」

これらの文字を見て思い当たる人は思い当たると思うが、Shift-JISの5C問題。いわゆるダメ文字。

cpp.cpp

	std::cout << 
		static_cast<json11::Json>(json11::Json::object({{ "dame", "ソ表能" }})).dump() << std::endl;

{"dame": "ソ\表\能\"}

ふぎゃー。

dumpをちょっぴり変更しましょ

0x81 ～ 0x9fと0xe0 ～ 0xfcが来たらそれはShift-JISの1バイト目なのでインクリメントしましょ。最後のelseブロックに3行追加。

json11.cpp

static void dump(const string &value, string &out) {
    out += '"';
    for (size_t i = 0; i < value.length(); i++) {
        const char ch = value[i];
        if (ch == '\\') {
            out += "\\\\";
        } else if (ch == '"') {
            out += "\\\"";
        } else if (ch == '\b') {
            out += "\\b";
        } else if (ch == '\f') {
            out += "\\f";
        } else if (ch == '\n') {
            out += "\\n";
        } else if (ch == '\r') {
            out += "\\r";
        } else if (ch == '\t') {
            out += "\\t";
        } else if (static_cast<uint8_t>(ch) <= 0x1f) {
            char buf[8];
            snprintf(buf, sizeof buf, "\\u%04x", ch);
            out += buf;
        } else if (static_cast<uint8_t>(ch) == 0xe2 && static_cast<uint8_t>(value[i+1]) == 0x80
                   && static_cast<uint8_t>(value[i+2]) == 0xa8) {
            out += "\\u2028";
            i += 2;
        } else if (static_cast<uint8_t>(ch) == 0xe2 && static_cast<uint8_t>(value[i+1]) == 0x80
                   && static_cast<uint8_t>(value[i+2]) == 0xa9) {
            out += "\\u2029";
            i += 2;
        } else {
            out += ch;
            if (((static_cast<uint8_t>(ch) >= 0x81) && (static_cast<uint8_t>(ch) <= 0x9f)) ||
                ((static_cast<uint8_t>(ch) >= 0xe0) && (static_cast<uint8_t>(ch) <= 0xfc)))
                out += value[++i];
        }
    }
    out += '"';
}

{"dame": "ソ表能"}

よござんした。

parse_stringでも同じ問題が出ていた

Twitterで情報提供いただきました。これパースでも似たような問題があるとの事。
どれどれ。　

main.cpp

	std::string err;
	std::cout << json11::Json::parse("{\"dame\": \"ソ表能\"}", err).dump() << std::endl;
	std::cout << err << std::endl;

invalid escape character (-107)

わーお。
これはparse_stringの中で同じ判定をしてポインタを進めてやれば良さそうですね。
ノンエスケープキャラを判定してる辺り……ま、この辺かな。

以下は抜粋です。上の修正コードを適用した場合は、505行目とかその辺ですかね。

json11.cpp

    string parse_string() {
        string out;
        long last_escaped_codepoint = -1;
        while (true) {
            if (i == str.size())
                return fail("unexpected end of input in string", "");

            char ch = str[i++];

            if (ch == '"') {
                encode_utf8(last_escaped_codepoint, out);
                return out;
            }

            if (in_range(ch, 0, 0x1f))
                return fail("unescaped " + esc(ch) + " in string", "");

            // The usual case: non-escaped characters
            if (ch != '\\') {
                encode_utf8(last_escaped_codepoint, out);
                last_escaped_codepoint = -1;
                out += ch;
                // ここから追加 ----------------------------------->
                if (((static_cast<uint8_t>(ch) >= 0x81) && (static_cast<uint8_t>(ch) <= 0x9f)) ||
                    ((static_cast<uint8_t>(ch) >= 0xe0) && (static_cast<uint8_t>(ch) <= 0xfc))) {
                    ch = str[i++];
                    out += ch;
                }
                // ここまで追加 -----------------------------------<
                    
                continue;
            }

自分が仕事で使うところしか見てないんで、他にもあるかも。
これで大体大丈夫だと思うんですが、どうでしょうね。

余談：5C問題について

余談であるが、と司馬遼太郎のような書き出しで恐縮だが、余談である。
以前単行コメントが禁止されていた事があった。「//」←これ
理由はこの5C問題によるもので、ダメ文字がコメントの最後に来ると、**\(バックスラッシュ)があるじゃん次の行もコメントじゃん、**と海外のコンパイラに判断されコンパイルエラーになってしまうという事があったためである。

C/C++以外ではあまり意識する事はないし、C++でも最近はShift-JISなんかぽーいMS932なんかペッペッペUTF-8マジ神なので（※個人の感想です）ダメ文字やShift-JISとWindows-31Jの差異とか気にすることは減ったのだけど、極東の島国で日本語とかいうおかしな言語を操る僕らは5C問題というものがあることを、大脳と脳梁の隙間あたりに置いておいてもいいかも知れない、という余談である。

Forked

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up