More than 5 years have passed since last update.

文字コードの話

Last updated at 2020-10-03Posted at 2016-09-19

Page 1 of 26

文字コード(文字符号化方式)いろいろ
文字化けについて

この話のゴール:
テキストデータが含む情報、含まない情報を知る

Introduction

"abc" という文字列をファイルに書き込む

% echo -n abc > abc.txt

このときファイルには何が記録されてる？

ファイルの中身

"abc" に決まってるじゃないか

% less abc.txt
abc

本当でしょうか？（ファイルとは何か？）

ファイルの中身 as binary

"abc" と書き込んだファイルの中身をダンプしてみると

% od -v -t x1 abc.txt
0000000 61 62 63
0000003

0x61, 0x62, 0x63 という長さ3のバイト列が書き込まれている。

文字符号化方式

'a' という文字を 0x61 という1バイトで表現
'b' という文字を 0x62 という1バイトで表現
'c' という文字を 0x63 という1バイトで表現

文字列を何らかのルールに沿ってバイト列で表現したものがファイルに保存されている。

文字の集合からそのバイト列表現の集合への写像
= 文字符号化方式(character encoding scheme)

ASCII

0x00〜0x7f の1byteで文字を表現

% python -c "f = open('ascii.txt', 'w'); \
f.write(''.join(chr(c) for c in range(0x80))); \
f.close()"
% od -v -t x1 -a ascii.txt
0000000  00  01  02  03  04  05  06  07  08  09  0a  0b  0c  0d  0e  0f
        nul soh stx etx eot enq ack bel  bs  ht  nl  vt  ff  cr  so  si
0000020  10  11  12  13  14  15  16  17  18  19  1a  1b  1c  1d  1e  1f
        dle dc1 dc2 dc3 dc4 nak syn etb can  em sub esc  fs  gs  rs  us
0000040  20  21  22  23  24  25  26  27  28  29  2a  2b  2c  2d  2e  2f
         sp   !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /
0000060  30  31  32  33  34  35  36  37  38  39  3a  3b  3c  3d  3e  3f
          0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?
0000100  40  41  42  43  44  45  46  47  48  49  4a  4b  4c  4d  4e  4f
          @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
0000120  50  51  52  53  54  55  56  57  58  59  5a  5b  5c  5d  5e  5f
          P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _
0000140  60  61  62  63  64  65  66  67  68  69  6a  6b  6c  6d  6e  6f
          `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o
0000160  70  71  72  73  74  75  76  77  78  79  7a  7b  7c  7d  7e  7f
          p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~ del
0000200

ASCII バイト列の特徴

バイト数 = 文字列長
0x20 から 0x7e が通常文字
0x30 から 0x39 は数字 (0-9)
0x30 を引けば対応する数に
0x41 から 0x5a は大文字 (A-Z)
0x20 を足すと小文字に
0x61 から 0x7a は小文字 (a-z)
0x20 を引くと大文字に

かなのバイト列表現

平仮名はどのように表現される？

% python
Python 3.8.2 (default, Mar 13 2020, 10:14:16)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "あいう".encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

あれ？

UnicodeEncodeError

Unicode

全ての文字を単一の集合で表す符号化文字集合
コードポイント U+000000〜U+10FFFF

UnicodeEncodeError

ユニコード文字列をバイト列にエンコードする時のエラー
指定した文字符号化方式ではバイト列表現が定まらない ("あ", "い", "う" は 'ascii' の定義域に含まれていなかった)

かなを許容する文字符号化方式

>>> tobytestr = lambda s: " ".join(
... "{:02x}".format(c) for c in s)
>>> tobytestr("あいう".encode("sjis"))
'82 a0 82 a2 82 a4'
>>> tobytestr("あいう".encode("euc-jp"))
'a4 a2 a4 a4 a4 a6'
>>> tobytestr("あいう".encode("iso-2022-jp"))
'1b 24 42 24 22 24 24 24 26 1b 28 42'
>>> tobytestr("あいう".encode("utf-8"))
'e3 81 82 e3 81 84 e3 81 86'
>>> tobytestr("あいう".encode("utf-16"))
'ff fe 42 30 44 30 46 30'
>>> tobytestr("あいう".encode("utf-16be"))
'30 42 30 44 30 46'
>>> tobytestr("あいう".encode("utf-16le"))
'42 30 44 30 46 30'
>>> tobytestr("あいう".encode("utf-32"))
'ff fe 00 00 42 30 00 00 44 30 00 00 46 30 00 00'
>>> tobytestr("あいう".encode("utf-32be"))
'00 00 30 42 00 00 30 44 00 00 30 46'
>>> tobytestr("あいう".encode("utf-32le"))
'42 30 00 00 44 30 00 00 46 30 00 00'

ASCIIとの違い

>>> tobytestr("abあい".encode("sjis"))
'61 62 82 a0 82 a2'

a -> [0x61]
b -> [0x62]
あ -> [0x82, 0xA0]
い -> [0x82, 0xA2]

文字によっては複数バイトで表現する

sjis, euc-jp, iso-2022-jp

JIS X 0208 と呼ばれる文字集合を表現

文字	JIS X 0208	Shift_JIS	EUC-JP	ISO-2022-JP
あ	4区2点	82 a0	a4 a2	24 22
い	4区4点	82 a2	a4 a4	24 24

>>> tobytestr("abあい".encode("sjis"))
'61 62 82 a0 82 a2'
>>> tobytestr("abあい".encode("euc-jp"))
'61 62 a4 a2 a4 a4'
>>> tobytestr("abあい".encode("iso-2022-jp"))
'61 62 1b 24 42 24 22 24 24 1b 28 42'

Shift_JIS: 2バイト文字の1バイト目 >= 0x80
EUC-JP: 2バイト文字の1、2バイト目 >= 0x80
ISO-2022-JP: 7bit (< 0x80) で表現
エスケープシーケンスで文字集合を切り替え
'1b 24 42': JIS X 0208に
'1b 28 42': US-ASCIIに

utf-8

>>> tobytestr("abあい".encode("utf-8"))
'61 62 e3 81 82 e3 81 84'

Unicode のコードポイントを1〜4バイトで表現

U+0000〜U+007f: 1バイト (0.......)
'a' (U+0061) → 61
U+0080〜U+07ff: 2バイト (110..... 10......)
'α' (U+03b1) → ce b1
U+0800〜U+ffff: 3バイト (1110.... 10...... 10......)
'あ' (U+3042) → e3 81 82
U+10000〜U+10ffff: 4バイト (11110... 10...... 10...... 10......)
'🍶' (U+1f376) → f0 9f 8d b6

utf-16

>>> tobytestr("abあい".encode("utf-16"))
'ff fe 61 00 62 00 42 30 44 30'
>>> tobytestr("abあい".encode("utf-16be"))
'00 61 00 62 30 42 30 44'
>>> tobytestr("abあい".encode("utf-16le"))
'61 00 62 00 42 30 44 30'

BOM (Byte Order Mark): 0xfeff
ff fe: Little Endian
fe ff: Big Endian
U+0000〜U+D7FF, U+E000〜U+FFFFは2バイトで表現
U+10000〜U+10FFFFはサロゲートペア(4バイト)で表現
110110.......... 110111..........

>>> tobytestr("\U0001f376".encode("utf-16"))
'ff fe 3c d8 76 df'

utf-32

>>> tobytestr("あいう".encode("utf-32"))
'ff fe 00 00 42 30 00 00 44 30 00 00 46 30 00 00'
>>> tobytestr("あいう".encode("utf-32be"))
'00 00 30 42 00 00 30 44 00 00 30 46'
>>> tobytestr("あいう".encode("utf-32le"))
'42 30 00 00 44 30 00 00 46 30 00 00'

BOM: 0x0000feff
1文字を4バイトで表現
バイト数 = 文字列長 x 4

バイト列→文字列

ここまでの話:

文字列をどうバイト列で表現するか
いろいろな文字符号化方式

逆に、バイト列が与えられたときに

それはどういう文字列の表現か？
あるいは文字列の表現になっていない？

デコード

テキストを解釈するプログラム

テキストエディタ (e.g. Emacs)
言語処理系 (e.g. Python)
Webブラウザ (e.g. Chrome)

→バイト列を文字列として解釈（デコード）

文字化け

エンコードに用いたのと異なる文字符号化方式でデコードすると意図しない文字列に

>>> print('ほげ'.encode('iso-2022-jp').decode('utf-8'))
$[$2
>>> print('ほげ'.encode('euc-jp').decode('sjis'))
､ﾛ､ｲ

文字化けの原因

テキストデータ自体は文字符号化方式を明示的に含まない。

文字符号化方式の自動判別に失敗した
ファイルから文字符号化方式は読み取れない
バイト列の傾向から自動判別するしか
間違った文字符号化方式を指定された

→ 誤動作の原因にもなり得る

テキストエディタに文字コードを伝える

vim

# vim: set fileencoding=<encoding name> :

Emacs

# -*- coding: <encoding name> -*-

Pythonにソースの文字コードを伝える

ソースの1行目か2行目に
正規表現 ^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)
例えば coding: <encoding name>

Pythonにテキストデータの文字コードを伝える

Python で文字列を符号化したバイト列が与えられたとき、 bytes.decode() でデコードして str に変換できる。

decode(encoding='utf-8', errors='strict') method of builtins.bytes instance
    Decode the bytes using the codec registered for encoding.

    encoding
      The encoding with which to decode the bytes.
    errors
      The error handling scheme to use for the handling of decoding errors.
      The default is 'strict' meaning that decoding errors raise a
      UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
      as well as any other name registered with codecs.register_error that
      can handle UnicodeDecodeErrors.

Webブラウザに文字コードを伝える

HTTPヘッダのContent-Typeのcharsetパラメータ
meta宣言のContent-Typeのcharsetパラメータ
各要素のcharset属性

まとめ

いろいろな文字符号化方式
ASCII
Shift_JIS, EUC-JP, iso-2022-jp
UTF-8, UTF-16, UTF-32
テキストデータ自体は文字符号化方式の情報を持たない
プロトコル/プログラムそれぞれの伝え方
バイト列を文字列として解釈する際のミス→文字化け、誤動作

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

文字コードの話

Contents

Introduction

ファイルの中身

ファイルの中身 as binary

文字符号化方式

ASCII

ASCII バイト列の特徴

かなのバイト列表現

UnicodeEncodeError

かなを許容する文字符号化方式

ASCIIとの違い

sjis, euc-jp, iso-2022-jp

utf-8

utf-16

utf-32

バイト列→文字列

デコード

文字化け

文字化けの原因

テキストエディタに文字コードを伝える

Pythonにソースの文字コードを伝える

Pythonにテキストデータの文字コードを伝える

Webブラウザに文字コードを伝える

まとめ