PDFからテキストを簡単に抽出するGoogle Colabスクリプトの紹介

Python

Posted at 2025-05-19

近年、デジタルドキュメントとして広く利用されているPDFファイル。しかし、PDFからテキストを取り出して再利用したい場面は多くあります。特に研究やビジネス、教育の現場では、PDFの内容をテキスト形式でコピー・編集・翻訳したいというニーズが高まっています。

そこで役立つのが「Google Colab」と「PyMuPDF」

Google Colabを使えば、Pythonスクリプトをブラウザ上で実行でき、ソフトのインストール不要で作業を始められます。さらに、PyMuPDF（fitzモジュール）を利用することで、PDFファイルからテキストを簡単かつ高速に抽出できます。

特徴的なポイント

ページごとに区切りを表示
各ページの始まりに「--- Page n ---」と表示し、区切りが一目で分かります。
余計な空白を自動除去
各ページの先頭・末尾の空白を削除し、スッキリしたテキストを出力します。
コピー・ペーストしやすい
出力結果をそのまま一括でコピーでき、別のドキュメントやエディターに貼り付けて活用できます。

実行手順

Google Colab でスクリプトを実行。
PDFファイルをアップロード。
自動的にPDF全ページからテキストを抽出。
画面に表示されたテキストをコピーして活用。

# Program Name: pdf_text_extractor_colab.py
# Creation Date: 20250520
# Overview: A script to extract and display plain text from uploaded PDF files in Google Colab.
# Usage: Run each cell in order on Google Colab to upload and process PDF files.

# --- Install required libraries / 必要なライブラリのインストール ---
!pip install PyMuPDF --quiet

# --- Import required modules / 必要なモジュールのインポート ---
import fitz  # PyMuPDF
from google.colab import files

# --- File Upload / ファイルアップロード ---
print("Please upload your PDF file.")
uploaded = files.upload()

# --- Extract and display text / テキスト抽出と表示 ---
for file_name in uploaded.keys():
    print(f"\nProcessing file: {file_name}\n")
    with fitz.open(file_name) as pdf_document:
        extracted_texts = []
        for page_num in range(len(pdf_document)):
            page = pdf_document.load_page(page_num)
            text = page.get_text()
            page_header = f"\n--- Page {page_num + 1} ---\n"
            extracted_texts.append(page_header + text.strip())
        
        # Combine all pages' text and display as plain block / ページごとのテキストをまとめて表示
        final_output = "\n".join(extracted_texts)
        print(final_output)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up