More than 1 year has passed since last update.

Python でメールファイルから本文を取り出す

Python

Posted at 2024-01-30

Python でメールファイルから本文取り出す方法を調べていたところ Message クラスを使っている例ばかりが出てきて、「なんで EmailMessage クラスを使わないのだろう」と思ったのでやってみた。

from email.message import EmailMessage


def parse_message(message: EmailMessage) -> tuple[str, str, list[tuple[str, str, str | bytes]]]:
    """EmailMessage から プレーンテキスト, HTML, 添付ファイル を取り出す"""
    text, html, attachments = "", "", []

    for part in message.walk():
        if part.is_multipart():
            continue
        content_type = part.get_content_type().lower()
        filename = part.get_filename()
        content = part.get_content()
        if filename:
            attachments.append((content_type, filename, content))
        elif content_type == "text/plain":
            text += content
        elif content_type == "text/html":
            html += content
        else:
            raise Error(f"Unknown Content-Type: {content_type}")

    return text, html, attachments

Amazon S3 にあるメールファイルを読み込んで本文を取り出す例がこんな感じ。

from email import message_from_binary_file
from email.policy import EmailPolicy
import boto3

s3 = boto3.client("s3")
obj = s3.get_object(Bucket="...", Key="...")
message = message_from_binary_file(obj["Body"], policy=EmailPolicy())
headers = message.items()
text, html, attachments = parse_message(message)

message_from_binary_file の引数 policy= は Python 3.2 と後方互換がある Compat32 がデフォルトになっており、このポリシーで読み込むと Message 型のメッセージが返ってくる。明示的に EmailPolicy を指定することで EmailMessage 型のメッセージを得ることができて、こちらのほうが parse する際に少し便利なので良い。(そこまで違いはないが)

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up