More than 1 year has passed since last update.

PythonでPDFからテキストを抽出し、テキストファイルに保存する

Last updated at 2023-06-06Posted at 2023-05-25

前提条件

Windows10
Python3
PyMuPDFライブラリ
Tkinterライブラリ

サンプルのPDFファイルは下記のウィキペディアトップページをPDFとしてローカルに保存したものを使用します。
https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

PyMuPDFライブラリの公式ドキュメントは以下です。
https://pymupdf.readthedocs.io/en/latest/

Tkinterライブラリの公式ドキュメントはこちらです。
https://docs.python.org/ja/3/library/tkinter.html

目的

PyMuPDFライブラリを利用して、ローカルにあるPDFからテキストを抽出し、それをテキストファイルに保存する。Windows OSによる文字化けを防ぐため、文字コードはShift_JISとして統一して扱う。

環境構築

PyMuPDFライブラリのインストール

pip install pymupdf

サンプル

pdf2text.py

# -*- coding: utf-8 -*-

import fitz # pymupdfライブラリ

pdf = 'C:\\Wikipedia.pdf'
doc = fitz.open(pdf)

file = f'{pdf}_to_text.txt'
f = open(file, 'w',  encoding='shift_jis', errors='ignore')

for page in range(len(doc)):
    tmp = doc[page].get_text()
    tmp2 = tmp.encode('shift_jis','ignore')
    text = tmp2.decode('shift_jis')
    f.write(text)

doc.close()
f.close()

Tkinterライブラリを利用し、PDFをGUIで選択できるようにする。

pdf2text2.py

# -*- coding: utf-8 -*-
import fitz # pymupdfライブラリ
from tkinter import filedialog #thinterライブラリ
import os

typ = [('pdfファイル','*.pdf')] 
dir = './'

pdf = filedialog.askopenfilename(filetypes = typ, initialdir = dir) 
doc = fitz.open(pdf)

file = os.path.splitext(pdf)[0] + '_to_text.txt'
f = open(file, 'w',  encoding='shift_jis', errors='ignore')

for page in range(len(doc)):
    tmp = doc[page].get_text()
    tmp2 = tmp.encode('shift_jis','ignore')
    text = tmp2.decode('shift_jis')
    f.write(text)

doc.close()
f.close()

参考資料

きっかけ

下記で抽出したテキストをまずはテキストファイルに保存したかったので、このプログラムを書いた。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up