More than 5 years have passed since last update.

UTF-8 エンコードを車輪の再発明してみる

Posted at 2019-08-31

概要

UTF-8 ってどうやるんだろう、と思った。
調べた。へー、こうなってたんだ。
Python で再現できるかな?

そういうことになった。

UTF-8 エンコードの方法

まず文字の Unicode コードポイントを取得する。
つぎに、以下のパターンに従ってバイト列(8桁2進数の連なり)をつくる。

コードポイントの値	バイト列の作り方
`7f(127)` まで	コードポイントを7桁の2進数にし、 `0xxxxxxx` に当てはめる。
`7ff(2047)` まで	コードポイントを11桁の2進数にし、 `110xxxxx 10xxxxxx` に当てはめる。
`ffff(265535)` まで	コードポイントを16桁の2進数にし、 `1110xxxx 10xxxxxx 10xxxxxx` に当てはめる。
`10ffff(1114111)` まで	コードポイントを21桁の2進数にし、 `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx` に当てはめる。

そして、完成したバイト列を16進数化することでエンコードの完了。

Python で再現

以下は

str.encode('UTF-8')

で済むことを興味本位で作ってみたものになります。こういうのは車輪の再発明といってとても愚かなことらしいです。(笑顔)

def manual_encode_UTF8(char):
    """str.encode('UTF-8') で出来ることをがんばって再現してみる。
    ただし引数は1文字にしてください。"""

    # 文字のコードポイントを取得します。
    codepoint = ord(char)

    # コードポイントの値によって、 UTF-8 バイト列の作成方法が分岐します。
    def calc(i):
        if i <= int('7f', 16):
            return _get_bytes1
        if i <= int('7ff', 16):
            return _get_bytes2
        if i <= int('ffff', 16):
            return _get_bytes3
        if i <= int('10ffff', 16):
            return _get_bytes4
    get_bytes_function = calc(codepoint)

    # UTF-8 バイト列を取得します。
    return get_bytes_function(codepoint)


def _get_bytes1(codepoint):
    """バイトが1個のときの UTF-8 バイト列作成。
    0xxxxxxx に2進数を当てはめます。
    """

    # コードポイントを x の数ぶんの2進数にします。
    codepoint_bin = '{:07b}'.format(codepoint)

    # x に2進数を当てはめます。
    bytes_bin = [
        '0' + codepoint_bin,
    ]

    # 各2進数を16進数に変換します。
    bytes_hex = map( lambda b: '{:02x}'.format(int(b, 2)), bytes_bin )

    # くっつけてバイト列にします。これが UTF-8 のバイト列です。
    return bytes.fromhex(''.join(bytes_hex))


def _get_bytes2(codepoint):
    """バイトが2個のときの UTF-8 バイト列作成。
    110xxxxx 10xxxxxx に2進数を当てはめます。
    """

    codepoint_bin = '{:011b}'.format(codepoint)
    bytes_bin = [
        '110' + codepoint_bin[ :5],
        '10'  + codepoint_bin[5: ],
    ]
    return bytes.fromhex( ''.join( map( lambda b: '{:02x}'.format(int(b, 2)), bytes_bin ) ) )


def _get_bytes3(codepoint):
    """バイトが3個のときの UTF-8 バイト列作成。
    1110xxxx 10xxxxxx 10xxxxxx に2進数を当てはめます。
    """

    codepoint_bin = '{:016b}'.format(codepoint)
    bytes_bin = [
        '1110' + codepoint_bin[  : 4],
        '10'   + codepoint_bin[ 4:10],
        '10'   + codepoint_bin[10:  ],
    ]
    return bytes.fromhex( ''.join( map( lambda b: '{:02x}'.format(int(b, 2)), bytes_bin ) ) )


def _get_bytes4(codepoint):
    """バイトが4個のときの UTF-8 バイト列作成。
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx に2進数を当てはめます。
    """

    codepoint_bin = '{:021b}'.format(codepoint)
    bytes_bin = [
        '11110' + codepoint_bin[  : 3],
        '10'    + codepoint_bin[ 3: 9],
        '10'    + codepoint_bin[ 9:15],
        '10'    + codepoint_bin[15:  ],
    ]
    return bytes.fromhex( ''.join( map( lambda b: '{:02x}'.format(int(b, 2)), bytes_bin ) ) )

テストしてみます。

#  UTF-8 エンコードが出来る全文字でテストします。
unicode_chars = ( chr(i) for i in range(int('10ffff', 16)+1) )
for char in unicode_chars:
    try:
        assert manual_encode_UTF8(char) == char.encode('UTF-8'), f'失敗したよ: {char}'
    except UnicodeEncodeError:
        pass

assert が発動しなかったので、全文字の UTF-8 エンコードが成功したことになります。やったあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up