More than 5 years have passed since last update.

文字列関数のテストのために文字を直積で生成する

Python

Last updated at 2017-09-22Posted at 2017-09-21

概要

文字列関数をテストするとき、文字を構成するバイトの範囲をもとにすべての文字を生成することが必要になります。直積を使えば、重点的にテストしたい範囲の組み合わせをピックアップして文字を生成することができるので、負荷を減らすことができます。

itertools.product

itertools.product を使う方法がもっとも手軽です。例題として UTF-8 の「あいう」のバイト列を調べてみましょう。

>>> 'あいう'.encode()
b'\xe3\x81\x82\xe3\x81\x84\xe3\x81\x86'

1バイト目が 0xe3、2バイト目が 0x81、3バイト目が 0x82 から 0x86 の範囲にあります。

バイト列の組み合わせを調べてみましょう。それぞれの組み合わせはタプルであらわされます。

>>> import itertools
>>> [tuple(map(lambda a: hex(a), t)) for t in itertools.product([0xe3], [0x81], range(0x81, 0x87))]
[('0xe3', '0x81', '0x81'), ('0xe3', '0x81', '0x82'), ('0xe3', '0x81', '0x83'), ('0xe3', '0x81', '0x84'), ('0xe3', '0x81', '0x85'), ('0xe3', '0x81', '0x86')]

数値の組み合わせを文字に変換してみましょう。バイト列を生成するのにタプルをリストに変換して bytes に渡します。

>>> [bytes(list(e)).decode() for e in itertools.product([0xe3], [0x81], range(0x81, 0x87))]
['ぁ', 'あ', 'ぃ', 'い', 'ぅ', 'う']

次のようにシーケンスのアンパックも利用できます。

>>> [bytes([*e]).decode() for e in itertools.product([0xe3], [0x81], range(0x81, 0x87))]

Python 3.5 から % はバイト型に対応しました (PEP 461)。bytes.format (PEP 460)の提案もありましたが、却下されました。

>>> [(b'%c' * len(e)  % e).decode() for e in itertools.product([0xe3], [0x81], range(0x81, 0x87))]
['ぁ', 'あ', 'ぃ', 'い', 'ぅ', 'う']

16進数文字列から変換する方法を選ぶのであれば、bytes.fromhex を使います。

>>> [bytes.fromhex(('{:x}'*len(t)).format(*t)).decode() for t in itertools.product([0xe3], [0x81], range(0x81, 0x87))]
['ぁ', 'あ', 'ぃ', 'い', 'ぅ', 'う']

bytes.join を使うこともできます。

>>> [b''.join(map(lambda a: a.to_bytes(1, byteorder='big'), t)).decode() for t in itertools.product([0xe3], [0x81], range(0x81, 0x87))]
['ぁ', 'あ', 'ぃ', 'い', 'ぅ', 'う']

リスト内包表記

リスト内包表記を入れ子にすることでも直積をつくることができます。

>>> lists = [[0xe3], [0x81], range(0x81, 0x87)]
>>> [e for e in [(a, b, c) for a in lists[0] for b in lists[1] for c in lists[2]]]
[(227, 129, 129), (227, 129, 130), (227, 129, 131), (227, 129, 132), (227, 129, 133), (227, 129, 134)]

文字を生成してみましょう。

>>> [bytes(list(e)).decode() for e in [(a, b, c) for a in lists[0] for b in lists[1] for c in lists[2]]]
['ぁ', 'あ', 'ぃ', 'い', 'ぅ', 'う']

往復変換が保障されない文字を調べる

cp932 の文字のうち utf-8 との往復変換で変わってしまうものを表示させてみましょう。マイクロソフトのサイトでリストが公開されているので、テストデータの参考にすることができます。

import itertools
import unicodedata

def check(l, enc):
    count = 0
    for t in itertools.product(*l):
        c = bytes(list(t))
        u = c.decode(enc, 'replace')

        if len(u) == 2 or ord(u) == 0xfffd:
            continue

        ret = u.encode(enc, 'replace')

        if c != ret:
            print(
                c.hex(), '->',
                'U+'+format(ord(u), 'x'),
                '->', ret.hex(),
                u, unicodedata.name(u)
            )
            count += 1

    print('cp932 unsafe:', count)


l = [range(0x87, 0xfb), range(0x40, 0xff)]
enc = 'cp932'

check(l, enc)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up