More than 1 year has passed since last update.

PythonでPDFファイルからテキストを抽出して出力する

Last updated at 2023-06-07Posted at 2023-05-25

前提条件

Windows10
Python3
PyMuPDFライブラリ
Tkinterライブラリ

サンプルのPDFファイルは下記のウィキペディアトップページをPDFとしてローカルに保存したものを使用します。
https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

PyMuPDFライブラリの公式ドキュメントは以下です。
https://pymupdf.readthedocs.io/en/latest/

Tkinterライブラリの公式ドキュメントはこちらです。
https://docs.python.org/ja/3/library/tkinter.html

目的

Pythonを使用してローカルにあるPDFからPyMuPDFライブラリを利用し、テキストを抽出します。
抽出したテキストを出力する前に、Windows OSでの文字化け対策のため、Shift_JISへ文字コードを変換してから、標準出力へ出力します。

環境構築

PyMuPDFライブラリのインストール。

pip install pymupdf

サンプル

pdfread.py

# -*- coding: utf-8 -*-

import fitz # pymupdfライブラリ

pdf = 'C:\\Wikipedia.pdf'
doc = fitz.open(pdf)

for page in range(len(doc)):
    tmp = doc[page].get_text()
    tmp2 = tmp.encode('shift_jis','ignore')
    text = tmp2.decode('shift_jis')
    print(text)

doc.close()

Tkinterライブラリを利用し、PDFをGUIで選択できるようにする。

pdfread2.py

# -*- coding: utf-8 -*-

import fitz # pymupdfライブラリ
from tkinter import filedialog #tkinterライブラリ

typ = [('pdfファイル','*.pdf')] 
dir = './'
pdf = filedialog.askopenfilename(filetypes = typ, initialdir = dir) 

doc = fitz.open(pdf)

for page in range(len(doc)):
    tmp = doc[page].get_text()
    tmp2 = tmp.encode('shift_jis','ignore')
    text = tmp2.decode('shift_jis')
    print(text)

doc.close()

参考資料

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up