More than 1 year has passed since last update.

Python3でQuoted Printableエンコーディングを再現してみた

Posted at 2022-07-16

Quoted Printable とは

以下、wikipediaからの引用です。

Quoted-printable（QP encodingとも呼ばれる）は、印字可能な文字(例えば、英数字や等号「=」)を使用した符号化方式であり、8ビットデータを7ビットデータパスで転送するためのものである。インターネット電子メールで使用できるようにするため、Content-Transfer-Encoding として定義されている。

コンピュータ黎明期で資源が少なく、1bitが貴重な昔の時代に、少しでもサイズを小さくしようと7bitで設計されたASCII文字と、utf-8などの他の文字コードの互換性を保つ必要があるシチュエーションにおいて、8bit ↔︎ 7bit のQuoted Printableエンコーディングが用いられることがある。
と解釈しています。

アルゴリズムの特徴

元のデータがASCII文字を多く含む場合、エンコードされた文字列をほぼそのまま読むことができ、エンコード後のサイズも小さくなる。
日本語など、ASCII文字を多く含まない場合はエンコード後は "=E3=81=AF" のようなの "="＋16進数の羅列になり、サイズも大きくなります。

e.g.
こんにちは世界 hello world!
⇅
=E3=81=93=E3=82=93=E3=81=AB=E3=81=A1=E3=81=AF=E4=B8=96=E7=95=8C hello world!

pythonによる実装

quopriライブラリを用いると、すぐにQuoted Printableを利用できます。
今回はアルゴリズムの理解を深めるため、自分なりにその動きを再現してみました。

エンコード

def QPeoncode(text):
    bytes_text = text.encode()
    character_count = 0
    result = []

    for i in range(len(bytes_text)):
        octet = bytes_text[i]
        is_converting = False

        if octet == 0x09 or octet == 0x20: # タブ or スペース
            # 次に改行が続く場合、もしくは最後の文字の場合は変換。それ以外はそのまま出力
            if i < len(bytes_text)-2 and bytes_text[i+1] == 0x0D and bytes_text[i+2] == 0x0A:
                is_converting = True
            elif i < len(bytes_text)-2 and bytes_text[i+1] == 0x0A:
                is_converting = True
            elif i == len(bytes_text)-1:
                is_converting = True
            else:
                is_converting = False

        elif octet == 0x0A or octet == 0x0D: # LF or CR
            # 改行コードの類は符号化しない
            is_converting = False
        else:
            # 0x21("!")から0x3C("<")まで、および 0x3E(">")から 0x7E("~") は符号化せずそのままにする 0x3D("=") は符号化する
            is_converting = octet <= 0x20 or 0x7F <= octet or octet == 0x3D


        if is_converting:
            result.append("=%X" % (octet))
            character_count += 3
        else:
            result.append(chr(octet))
            character_count += 1


    return "".join(result)

与えられた平文をまずbyteに変換し、それぞれのbyte毎に変換すべき文字かどうかをis_convertingフラグを用いて判定します。(ASCIIで表現できる文字の場合は変換する必要がない)

変換が必要な場合は、"="+16進数の形に変換します。
(result.append("=%X" % (octet))の部分)
変換が必要ない場合はそのままresultに追加します

デコード

import re

def QPdecode(text):
    byte = bytes()
    # コードの削除
    text = re.sub('[\r\n]+$', '', text)

    idx = 0
    while idx < len(text):
        c = text[idx]
        if c == "=":
            octetStr = text[idx+1: idx+3]
            octet = bytes.fromhex(octetStr)
            byte += octet
            idx += 2
        else:
            # エンコードされていないASCII文字の場合
            byte += c.encode()

        idx += 1

    result = byte.decode()
    return result

動作確認

text = "こんにちは世界 hello world!"
encoded = QPeoncode(text)
decoded = QPdecode(encoded)

print(text)
print(encoded)
print(decoded)

# quopriと同じ結果を返すかのチェック
print(encoded == quopri.encodestring(text.encode("utf-8")).decode())

参考サイト

Algofulさんはとてもわかりやすくておすすめです↑

最後まで読んで頂きありがとうございました。何か間違いなどありましたら気軽にコメントお願いいたします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up