Discordに送信された資料(PDF,PowerPoint,Word,Excel)を画像に変換してスレッドに送信

Last updated at 2023-12-01Posted at 2023-12-01

この記事は以下の記事を改善・拡張したものです。

身内用に作ったもののメモの進化版

背景

PDFや資料(PowerPoint,Word,Excel)が送信されたときに、それをダウンロードしないと中身が見れないのが不便
特に、スマホから見るときに不便

なので、資料が送信されたら画像に変換するものを作成します

仕様

サーバーのテキストチャンネルに(PDFファイル、Wordファイル、PowerPointファイル、Excelファイル)が送信されたとき、そのテキストチャンネルにスレッドを作成し、資料ファイルを画像化したものを送信する。

環境

Docker

python:3.9.13-bullseye
discord.py[voice]==2.3.2
pdf2image==1.16.2

Windows環境であれば Office 365 が使えますが、Linux上で動作させたかったので、今回はフリーのオフィスソフトである「LibreOffice」を採用しています。

インストール

requirements.txt

discord.py[voice]==2.3.2
pdf2image==1.16.2

Dockerfile

FROM python:3.9.13-bullseye

# タイムゾーン
RUN apt update; apt -y install tzdata && \
cp /usr/share/zoneinfo/Asia/Tokyo /etc/localtime

RUN apt update
RUN apt -yV upgrade

# poppler
RUN apt install -y poppler-utils poppler-data

# liboffice
RUN apt install -y libgl1-mesa-dev
RUN apt install -y libreoffice libreoffice-l10n-ja libreoffice-dmaths libreoffice-ogltrans libreoffice-writer2xhtml libreoffice-help-ja

# 日本語用のフォント
RUN wget https://moji.or.jp/wp-content/ipafont/IPAexfont/IPAexfont00301.zip
RUN unzip IPAexfont00301.zip
RUN mkdir -p /usr/share/fonts/ipa
RUN cp IPAexfont00301/*.ttf /usr/share/fonts/ipa

# フォントを更新
RUN fc-cache -fv

RUN pip install -U pip==23.0.1

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

・・・

CMD ["python", "main.py"]  # 実行

処理の流れ

添付ファイルの形式を確認
スレッドを作成
Officeファイルだったら
- PDFに変換してから画像に変換
PDFだったら
- 画像に変換
10枚ごとに分けてスレッドに送信

Officeファイルはローカルのファイルしか変換できない(自分調べ)なので、ダウンロードしてPDFに変換した後削除
PDFファイルは、バイナリとしてダウンロードして変換
前回は同期で変換してたが、今回は非同期で実装

Cogで実装しています

コード

file_viewer.py

# 外部モジュール
import asyncio
import discord
from discord.ext import commands
import io
import os
import pdf2image

# 内部モジュール
from mylib import PDFConverter


class FileViewer(commands.Cog):
    def __init__(self, bot):
        self.bot = bot
        self.supported_extensions = [
            # .pdf
            "application/pdf",
            # .xls
            "application/vnd.ms-excel",
            # .xlsx
            "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            # .doc
            "application/msword",
            # .docs
            "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            # .ppt
            "application/vnd.ms-powerpoint",
            # .pptx
            "application/vnd.openxmlformats-officedocument.presentationml.presentation",
        ]

    @commands.Cog.listener()
    async def on_message(self, message):
        if len(message.attachments) == 0:
            return
        if message.channel.type != discord.ChannelType.text:
            return
        # 添付されたファイルの中に対応している拡張子がなければ無視
        attachments = [
            attachment
            for attachment in message.attachments
            if attachment.content_type in self.supported_extensions
        ]
        if len(attachments) == 0:
            return
        thread = await message.create_thread(name=attachments[0].filename)
        for attachment in attachments:
            loop = asyncio.get_running_loop()
            images = []
            # pdf -> jpeg
            if attachment.content_type == "application/pdf":
                pdf_io = io.BytesIO()
                await attachment.save(pdf_io)
                images = await loop.run_in_executor(
                    None, pdf2image.convert_from_bytes, pdf_io.read()
                )
            elif attachment.content_type in self.supported_extensions:
                await attachment.save(attachment.filename)
                converter = PDFConverter(attachment.filename, ".")
                await loop.run_in_executor(None, converter.start)
                images = await loop.run_in_executor(
                    None,
                    pdf2image.convert_from_path,
                    attachment.filename.replace(attachment.filename.split(".")[-1], "pdf"),
                )
                os.remove(attachment.filename)
                os.remove(attachment.filename.replace(attachment.filename.split(".")[-1], "pdf"))

            await thread.send(
                embed=discord.Embed(
                    title=attachment.filename, color=discord.Color.blue()
                )
            )
            # 最大10枚ごとの2次元配列に変換
            images = [images[idx : idx + 10] for idx in range(0, len(images), 10)]
            count = 1
            for image_container in images:
                files = []
                for image in image_container:
                    fileio = io.BytesIO()
                    image.save(fileio, format="jpeg")
                    fileio.seek(0)
                    files.append(discord.File(fileio, filename="image.jpg"))
                    count += 1
                await thread.send(
                    content=f"{count-len(files)}~{count-1}ページ", files=files
                )


async def setup(bot):
    await bot.add_cog(FileViewer(bot))

mylib/PDFConverter.py

import glob
import logging
import os
import subprocess
import shutil


default_user_profile = os.environ["HOME"] + "/.config/libreoffice/4/user"


class PDFConverter:
    def __init__(
        self,
        file_in: str,
        file_out: str,
        timeout_sec: int = 30,
        user_profile: str = None,
    ):
        self.file_in = file_in  # 変換対象のOffice文書
        self.file_out = file_out  # 変換されたPDF文書の格納ディレクトリ
        self.timeout_sec = timeout_sec  # 変換のタイムアウトリミット
        # デフォルトのユーザプロファイルから、新しいユーザプロファイルを作成
        self.user_profile = user_profile
        if self.user_profile:
            if not os.path.exists(self.user_profile):
                shutil.copytree(default_user_profile, self.user_profile)

    def __enter__(self):
        return self

    def __exit__(self):
        self.stop()

    def start(self):
        args = [
            "libreoffice",
            "--headless",
            "--language=ja",
            '--infilter=",,64"',
            "--convert-to",
            "pdf",
            self.file_in,
            "--outdir",
            self.file_out,
        ]
        if self.user_profile:
            args.append("-env:UserInstallation=file://%s" % self.user_profile)
        stdout_str = ""
        stderr_str = ""
        rc = 0
        try:
            # PDF変換実行、タイムアウトになったらsofficeプロセスを終了させる
            ret = subprocess.run(
                args,
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                timeout=self.timeout_sec,
                check=True,
                text=True,
            )
            rc = ret.returncode
            stdout_str = ret.stdout
            stderr_str = ret.stderr
        except subprocess.CalledProcessError as cpe:
            rc = -1
            stdout_str = cpe.stdout
            stderr_str = cpe.stderr
        except subprocess.TimeoutExpired as te:
            rc = -2
            stdout_str = te.stdout
            stderr_str = te.stderr
        finally:
            if stdout_str:
                logging.info(stdout_str)
            if stderr_str:
                logging.info(stderr_str)
            self.stop()
            return rc

    def stop(self):
        # タイムアウト時に生成される一時ファイルを削除
        tmp_files = self.file_out + "/*.tmp"
        for f in glob.glob(tmp_files):
            os.remove(f)
        logging.info("soffice finished")

動作状態

PC

スマホ

これでわざわざダウンロードしなくても良くなりましたね♪

今後の課題

MSOfficeファイルをLibreOfficeで変換するため、形式が崩れる場合がある
スレッドに送るため、サーバーチャンネルでしか動作しない
負荷がかかるため、公開Botとしての運用が難しい

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up