More than 5 years have passed since last update.

[Python3] Shift_JISとUTF-8とASCIIを行き来する

Posted at 2017-02-02

はじめに

Python3で、'\udc82Ђ\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'のように、化けて出力されてしまった文字列を、なんとかして正しく表示させようという試みです。

参考

Shift_JISのバイト列をUTF-8デコードしてしまったとき

デフォルトでは`UnicodeDecodeError`

Shift_JISのバイト列をデコード(デフォルトではUTF-8)しようとすると、UnicodeDecodeErrorが発生します

>>> bytes_sjis = "ひらカタ漢字".encode("shift_jis")
>>> bytes_sjis
b'\x82\xd0\x82\xe7\x83J\x83^\x8a\xbf\x8e\x9a'
>>> bytes_sjis.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

エラーハンドラを指定したときのデコード結果

前項でエラーが発生するのは、decode()のオプションの引数errorsのデフォルトが"strict"となっているためです。errorsにほかの値を与えると、エラーが発生せず、別の文字列が返ってきます。

errors 引数は、入力文字列に対しエンコーディングルールに従った変換ができなかったときの対応方法を指定します。
この引数に使える値は
'strict' (UnicodeDecodeError を送出する)、
'replace' (REPLACEMENT CHARACTER である U+FFFD を使う)、
'ignore' (結果となる Unicode から単に文字を除く) 、
'backslashreplace' (エスケープシーケンス \xNN を挿入する) です。
Unicode HOWTO

そのほかに、'surrogateescape'も指定できて、

'surrogateescape' - バイト列を U+DC80 から U+DCFF の範囲の個々のサロゲートコードで置き換えます。
7.2. codecs — codec レジストリと基底クラス

出力結果を見てみましょう。

>>> bytes_sjis.decode("utf-8", errors="replace")
'�Ђ�J�^����'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'ЂJ^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82Ђ\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82Ђ\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'

ちなみに、UTF-8環境下でない、Windows環境(CP932~Shift_JIS)での表示は下記になります。

>>> bytes_sjis.decode("utf-8", errors="replace")
'\ufffd\u0402\ufffdJ\ufffd^\ufffd\ufffd\ufffd\ufffd'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'\u0402J^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82\u0402\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82\u0402\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'

Shift_JISのバイト列をUTF-8でデコードした結果から元の文字列に戻す

'replace' や 'ignore' を指定したときは、情報が削除されているので元に戻せませんが、
そのほかの場合は、下記のように元の文字列を復元できます。

backslashreplaceの場合はなぜこれで可能なのでしょうか・・・？？？

>>> bytes_sjis = "ひらカタ漢字".encode("shift_jis")
>>> backslash_str = bytes_sjis.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode("unicode_escape").encode("raw_unicode_escape").decode("shift_jis")
'ひらカタ漢字'

>>> surrogate_str = bytes_sjis.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode("utf-8", errors="surrogateescape").decode("shift_jis")
'ひらカタ漢字'

UTF-8のバイト列をASCIIデコードしてしまったとき

Shift_JIS -> UTF-8 -> Shift_JISの変換と同じように、UTF-8 -> ASCII -> UTF-8もやってみましょう。

デフォルトでは`UnicodeDecodeError`

>>> bytes_utf8 = "ひらカタ漢字".encode("utf-8")
>>> bytes_utf8
b'\xe3\x81\xb2\xe3\x82\x89\xe3\x82\xab\xe3\x82\xbf\xe6\xbc\xa2\xe5\xad\x97'
>>> bytes_utf8.decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

エラーハンドラを指定したときのデコード結果

>>> bytes_utf8.decode("ascii", errors="ignore")
''
>>> bytes_utf8.decode("ascii", errors="replace")
'������������������'
>>> bytes_utf8.decode("ascii", errors="backslashreplace")
'\\xe3\\x81\\xb2\\xe3\\x82\\x89\\xe3\\x82\\xab\\xe3\\x82\\xbf\\xe6\\xbc\\xa2\\xe5\\xad\\x97'
>>> bytes_utf8.decode("ascii", errors="surrogateescape")
'\udce3\udc81\udcb2\udce3\udc82\udc89\udce3\udc82\udcab\udce3\udc82\udcbf\udce6\udcbc\udca2\udce5\udcad\udc97'

UTF-8のバイト列をASCIIでデコードした結果から元の文字列に戻す

UTF-8とASCIIの場合は、.encode().decode()すればOKのようです。

>>> backslash_str = bytes_utf8.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode()
'ひらカタ漢字'
>>> surrogate_str = bytes_utf8.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode().decode()
'ひらカタ漢字'

よくある例

最後に、設定し忘れて文字化けしてしまったときに力業で戻してみようという例を並べてみます。

`json`出力時`ensure_ascii=False`にし忘れたとき

>>> import json
>>> ascii_json = json.dumps({"キー":"値"})
>>> ascii_json
'{"\\u30ad\\u30fc": "\\u5024"}'
>>> ascii_json.encode().decode("unicode_escape")
'{"キー": "値"}'
>>> ascii_json.encode().decode("raw_unicode_escape")
'{"キー": "値"}'

`requests`で取得した結果の`encoding`を変更していないとき

>>> import requests
>>> r = requests.get('http://www.mof.go.jp/')
>>> r.text
'...
<meta property="og:title" content="\x8dà\x96±\x8fÈ\x83z\x81[\x83\x80\x83y\x81[\x83W" />
...'
>>> r.text.encode("raw_unicode_escape").decode("shift_jis")
'...
<meta property="og:title" content="財務省ホームページ" />
...'

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[Python3] Shift_JISとUTF-8とASCIIを行き来する

はじめに

Shift_JISのバイト列をUTF-8デコードしてしまったとき

デフォルトではUnicodeDecodeError

エラーハンドラを指定したときのデコード結果

Shift_JISのバイト列をUTF-8でデコードした結果から元の文字列に戻す

backslashreplaceの場合はなぜこれで可能なのでしょうか・・・？？？

UTF-8のバイト列をASCIIデコードしてしまったとき

デフォルトではUnicodeDecodeError

エラーハンドラを指定したときのデコード結果

UTF-8のバイト列をASCIIでデコードした結果から元の文字列に戻す

よくある例

json出力時ensure_ascii=Falseにし忘れたとき

requestsで取得した結果のencodingを変更していないとき

デフォルトでは`UnicodeDecodeError`

デフォルトでは`UnicodeDecodeError`

`json`出力時`ensure_ascii=False`にし忘れたとき

`requests`で取得した結果の`encoding`を変更していないとき