ChatGPTのAPIを利用してPDFスライドを要約してWordに出力するPythonコード

Last updated at 2024-08-23Posted at 2024-08-22

概要

この記事では、PDF形式のスライドを読み込み、OpenAIのGPT-4を使用して各スライドの要約を生成し、さらに各スライドの重要な用語と説明リストを作成、それをWord文書に出力するPythonコードを紹介します。Anaconda環境で必要なライブラリのインストール方法から、popplerのインストール方法、コードの詳細な説明までをステップ・バイ・ステップで解説します。文章はGPT4oで作成しています。

動作環境

OS: Windows 10
Python: 3.8 以上 (Anaconda環境)
必要なライブラリ: pdfplumber, openai, python-docx, pdf2image, reportlab
必要なツール: poppler（バージョン24.07.0）

出力例

引用元：経産省、半導体・デジタル産業戦略
　https://www.meti.go.jp/policy/mono_info_service/joho/conference/semicon_digital/0011/0011-2.html

アウトプット：各スライドの下部にSummaryが追加される。

アウトプット：ページの最終に重要な用語と説明をリスト化

コード全文

import pdfplumber
import openai
from docx import Document
from docx.shared import Inches
from pdf2image import convert_from_path
import os
import time
import gc

# OpenAI APIキーの設定
openai.api_key = 'ChatGPTのAPI'

# 入力と出力のファイルパスをここで設定
pdf_path = '元のPDFファイルのパス.pdf'
output_word_path = '出力ファイルの名.docx'
temp_image_dir = '一時画像格納パス/temp_images'

# 一時画像保存用のディレクトリを作成
if not os.path.exists(temp_image_dir):
    os.makedirs(temp_image_dir)

def summarize_text(text):
    while True:
        try:
            print("Sending text to OpenAI for summarization...")
            response = openai.ChatCompletion.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": f"以下のテキストを日本語で箇条書きで要約してください。ただし、文頭に[・-]は不要です。:\n{text}"}
                ],
                max_tokens=500
            )
            summary = response['choices'][0]['message']['content'].strip()
            print("Received summary from OpenAI.")
            return summary
        except openai.error.RateLimitError:
            print("Rate limit exceeded. Waiting for 60 seconds...")
            time.sleep(60)

def extract_keywords_and_definitions(text):
    while True:
        try:
            print("Extracting keywords and definitions...")
            response = openai.ChatCompletion.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": f"以下のテキストから重要な用語を抽出し、それぞれ「用語(英語での用語の名前)：説明文」で書いてください。フレーズ「以上が、テキストから抽出した重要な用語とその説明です。」および「以下は、テキストから抽出した重要な用語とその説明です。」も含めないでください。リストに番号も不要です。必ずこの形式を守ってください。:\n{text}"}
                ],
                max_tokens=500
            )
            keywords_definitions = response['choices'][0]['message']['content'].strip()
            print("Received keywords and definitions from OpenAI.")
            return keywords_definitions
        except openai.error.RateLimitError:
            print("Rate limit exceeded. Waiting for 60 seconds...")
            time.sleep(60)

def format_keywords_definitions(doc, keywords_definitions):
    lines = keywords_definitions.split("\n")
    seen_terms = set()  # 用語の重複を避けるためのセット
    for line in lines:
        if ":" in line and line.strip():
            term, definition = line.split(":", 1)
            term_name = term.split("(")[0].strip()  # 用語部分のみを抽出
            if term_name not in seen_terms:
                seen_terms.add(term_name)
                # 用語リストを「・用語(英語での名称)：説明文」の形式にする
                p = doc.add_paragraph()
                p.add_run(f"・{term}：{definition.strip()}")
        elif line.strip() and line not in seen_terms:  # 空行をスキップ
            seen_terms.add(line.strip())
            doc.add_paragraph(f"・{line.strip()}")

def main():
    print("Starting PDF processing...")
    
    # PDFを開く
    with pdfplumber.open(pdf_path) as pdf:
        doc = Document()
        total_pages = len(pdf.pages)
        
        # PDFページを低解像度のJPEG画像に変換
        print(f"Converting {total_pages} PDF pages to images...")
        images = convert_from_path(pdf_path, 150, output_folder=temp_image_dir, fmt='jpeg')
        print("Conversion to images complete.")
        
        all_keywords_definitions = ""

        for i, page in enumerate(pdf.pages, start=1):
            print(f"Processing slide {i}/{total_pages}...")
            text = page.extract_text()
            image_path = os.path.join(temp_image_dir, f"slide_{i}.jpg")
            images[i-1].save(image_path, 'JPEG')  # JPEGで画像を保存
            images[i-1].close()  # 画像ファイルを閉じる
            print(f"Slide {i} saved as image.")
            
            if text:
                summary = summarize_text(text)
                keywords_definitions = extract_keywords_and_definitions(text)
                all_keywords_definitions += keywords_definitions + "\n"
                
                # Wordにスライドの画像と要約を追加
                print(f"Adding slide {i} image and summary to Word document...")
                doc.add_heading(f'Slide {i}', level=1)
                doc.add_picture(image_path, width=Inches(6))
                doc.add_heading('Summary', level=2)
                for bullet_point in summary.split("\n"):
                    if bullet_point.strip():
                        doc.add_paragraph(f"{bullet_point.strip()}", style='List Bullet')  # [・-]は削除
                doc.add_page_break()

        # 用語リストをWordファイルに追加
        if all_keywords_definitions:
            print("Adding keywords and definitions to Word document...")
            doc.add_heading('Keywords and Definitions', level=1)
            format_keywords_definitions(doc, all_keywords_definitions)
        
        # Wordファイルとして保存
        print("Saving Word document...")
        doc.save(output_word_path)
        print("Processing complete. Summary saved to Word file.")

        # ガベージコレクションを実行して未解放のリソースを解放
        gc.collect()

        # 一時画像を削除
        print("Cleaning up temporary image files...")
        for image_file in os.listdir(temp_image_dir):
            os.remove(os.path.join(temp_image_dir, image_file))
        print("Cleanup complete.")

# 実行
if __name__ == "__main__":
    main()

ライブラリとツールのインストール

1. `poppler` のインストール

pdf2imageライブラリは、PDFを画像に変換するためにpopplerというツールを使用します。popplerは、PDFファイルの処理を行うためのオープンソースライブラリです。

手順

popplerのダウンロード:
- popplerは、poppler for Windowsからダウンロードできます。
- poppler-24.07.0 のバイナリをダウンロードします。
popplerのインストール:
- ダウンロードしたZIPファイルを任意の場所に解凍します。例えば、C:\poppler-24.07.0 に解凍します。
環境変数の設定:
- Windowsの「環境変数」にpopplerのパスを追加します。
- 「システム環境変数の編集」から「環境変数」を開き、「システム環境変数」の「Path」にC:\poppler-24.07.0\binを追加します。

2. Anaconda環境でのライブラリのインストール

次に、Anaconda環境に必要なPythonライブラリをインストールします。

# 必要なライブラリをインストール
conda install -c conda-forge pdfplumber
conda install -c conda-forge python-docx
conda install -c conda-forge pdf2image
pip install openai

openai は pip でインストールする必要があるので注意してください。

コードの概要

このコードは、以下のステップで構成されています。

PDFの読み込み: pdfplumber を使用してPDFファイルを読み込みます。
画像変換: pdf2image を使用して各スライドをJPEG画像に変換します。
要約生成: OpenAIのGPT-4を使用して、各スライドのテキストを要約します。
Word出力: python-docx を使用して、画像と要約をWord文書に挿入します。
一時ファイルのクリーンアップ: 処理が終わった後、作成された一時画像ファイルを削除します。

ステップ・バイ・ステップの説明

1. ライブラリのインポート

このスクリプトでは、PDFからのテキスト抽出、画像変換、Wordファイルの生成に必要なライブラリをインポートします。

2. OpenAI APIキーとファイルパスの設定

ユーザーがOpenAI APIキーとPDFファイルのパス、Wordファイルの出力パスを入力するように促します。

3. 一時画像保存用ディレクトリの作成

PDFページを画像として保存するためのディレクトリを作成します。

4. PDFテキストの要約

OpenAIのAPIを使用して、PDFから抽出したテキストを要約します。

5. 重要用語の抽出

PDFテキストから重要な用語とその説明を抽出します。

6. 用語リストのフォーマット

重複を避けながら、抽出した用語リストをWordファイルに追加します。

7. メイン処理

PDFの各ページを処理し、要約と用語リストをWordファイルに保存します。

8. スクリプトの実行

最後に、スクリプトを実行して、Wordファイルを生成します。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up