More than 1 year has passed since last update.

PyMuPDFを使ってPDFを分解してPNGにする

Posted at 2024-02-27

背景

PythonでPDFをページごとに分割し、PNGファイルに変換したい場合
pypdfやpdfminerなどいくつかPDFを操作するPythonライブラリは存在したのですが、Google Trendsで勢いのあったpyMuPDFを使ってみることにします
他のライブラリに比べてドキュメントも一番ちゃんとしているように見えます

実装

import fitz
from io import BytesIO

class PDF:

    # pdfがテキストベースのものか画像ベースのものか判断する
    def is_text_base(self, file: BytesIO) -> bool:
        doc = fitz.open(stream=file.getvalue(), filetype="pdf")
        return (len(doc[0].get_text().encode("utf8")) > 0)

    # pdfの各ページをpngとしてローカルに保存します
    def convert_to_png_files(self, file: BytesIO) -> None:
        doc = fitz.open(stream=file.getvalue(), filetype="pdf")
        for page in doc:
            pix = page.get_pixmap()
            file_path = f"./page_{page.number}.png"
            pix.save(file_path)
        return

if __name__ == "__main__":
    pdf = PDF()
    with open('path/to/hoge.pdf', 'rb') as f:
        if pdf.is_text_base(f):
            pdf.convert_to_png_files(f)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up