PDFの論文を翻訳するためのテキスト抽出と分割

Last updated at 2025-04-19Posted at 2025-04-19

はじめに

背景

社内で文章を翻訳する際、色々制限があり無償版のDeepLや社内のトークン数が少ない生成AIをなんとか駆使して論文等を読み漁っています。その際に、入力欄にPDFから文章を上限文字数を超えない範囲でコピペをするのが面倒で少しでも楽にしたいと思い作成しました

できること

PDFから文字を抽出し、
指定した文字数範囲で文章を分割したものを、
1つのテキストファイルに出力する

対象

PDFの論文を日本語に直したい人で、
DeepLやChatGPTなどの文字数上限により一括で翻訳ができず、
手作業で適当な分量のテキストをコピーしてツールにコピペしている人

本編

実施例

使用する論文
https://www.mdpi.com/1996-1073/16/4/2053

文字数範囲
1000−1500文字

分割後のテキストファイル

--- Chunk 1 (長さ: 1381 文字) ---
Citation: Guo, Y. ; Ba, X. ; Liu, L. ; Lu, H. ; Lei, G. ; Yin, W. ; Zhu, J. A Review of Electric Motors with Soft Magnetic Composite Cores for Electric Drives. Energies 2023, 16, 2053. https:// doi. org/10. 3390/en16042053 Academic Editors: Antonio Morandi, João Filipe Pereira Fernandes, Jordi-Roger Riba Ruiz, Paulo Jose Da Costa Branco and Silvio Vaschetto Received: 25 January 2023 Revised: 9 February 2023 Accepted: 15 February 2023 Published: 19 February 2023 Copyright: © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons. org/licenses/by/ 4. 0/). energies Review A Review of Electric Motors with Soft Magnetic Composite Cores for Electric Drives Youguang Guo 1, * , Xin Ba 1, * , Lin Liu 1, * , Haiyan Lu 1, Gang Lei 1 , Wenliang Yin2 and Jianguo Zhu 2 1 Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia 2 School of Electrical and Information Engineering, The University of Sydney, Camperdown, NSW 2006, Australia * Correspondence: youguang. guo-1@uts. edu. au (Y. G. ); xin. ba@student. uts. edu. au (X. B. ); lin. liu@student. uts. edu. au (L. L. ) Abstract: Electric motors play a crucial role in modern industrial and domestic applications.
-----------------------------

--- Chunk 2 (長さ: 1411 文字) ---
With the trend of more and more electric drives, such as electric vehicles (EVs), the requirements for electric motors become higher and higher, e. g. , high power density with good thermal dissipation and high reliability in harsh environments. Many efforts have been made to develop high performance electric motors, such as the application of advanced novel electromagnetic materials, modern control algorithms, advanced mathematical modeling, numerical computation, and artiﬁcial intelligence based optimization design techniques. Among many advanced magnetic materials, soft magnetic composite (SMC) appears very promising for developing novel electric motors, thanks to its many unique properties, such as magnetic and thermal isotropies, very low eddy current loss, and the prospect of low-cost mass production. This paper aims to present a comprehensive review about the application of SMC for developing various electric motors for electric drives, with emphasis on those with three-dimensional (3D) magnetic ﬂux paths. The major techniques developed for designing the 3D ﬂux SMC motors are also summarized, such as vectorial magnetic property characterization and system-level multi-discipline robust design optimization. Major challenges and possible future work in this area are also discussed. Keywords: soft magnetic composite; electric motor; magnetic isotropy; three-dimensional magnetic flux 1.
-----------------------------
・・・・続く

ライブラリ

PDFの操作にはpypdfを使用しました。
まず初めに、pypdfのインストールから実施してください。

pip install pypdf

ソースコード

import pypdf
import sys
import os
import re # reをまとめてインポート

# Optionalをインポート (型ヒント用)
from typing import List, Optional

def extract_text_from_pdf(pdf_path: str) -> Optional[str]:
    """
    指定されたPDFファイルからテキストを抽出する
    Args:
        pdf_path: 処理するPDFファイルのパス
    Returns:
        抽出されたテキスト。エラーが発生した場合はNone。
    """
    text = ""
    try:
        with open(pdf_path, 'rb') as f:
            reader = pypdf.PdfReader(f)
            # ページ数が0の場合はエラーとする
            if len(reader.pages) == 0:
                print(f"エラー: PDFファイルにページが含まれていません: {pdf_path}", file=sys.stderr)
                return None

            for page_num in range(len(reader.pages)):
                page = reader.pages[page_num]
                # extract_text()がNoneを返す可能性を考慮
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n" # 各ページ終わりに改行を追加
                else:
                     # 警告レベルのメッセージ
                     print(f"警告: ページ {page_num + 1} からテキストを抽出できませんでした。", file=sys.stderr)

    except pypdf.errors.PdfReadError:
        print(f"エラー: PDFファイルを読み込めません。ファイルが破損しているか、パスが間違っています: {pdf_path}", file=sys.stderr)
        return None
    except FileNotFoundError:
        print(f"エラー: 指定されたファイルが見つかりません: {pdf_path}", file=sys.stderr)
        return None
    except Exception as e:
        print(f"PDF読み取り中に予期せぬエラーが発生しました: {e}", file=sys.stderr)
        return None
    return text

def split_text_by_sentences(text: str) -> List[str]:
    """
    テキストをピリオド(.)、疑問符(?)、感嘆符(!)で分割し、各要素に区切り文字を付け直す。
    単純な分割のため、略語(Dr., Mr.)、小数点を含む数字(3.14)、省略記号(...)
    などで意図せず分割される可能性があります。
    Args:
        text: 分割するテキスト
    Returns:
        センテンスのリスト
    """
    # 連続する改行や空白を一つにまとめる
    text = re.sub(r'\s+', ' ', text).strip()

    if not text:
        return []

    sentences = []
    # ピリオド、疑問符、感嘆符で分割する（区切り文字の直後で分割し、区切り文字は保持する）
    # Lookbehind assertion (?<=[.!?]) を使って、区切り文字の直後にあるスペースで分割する方が、
    # 区切り文字と後続のテキストの間のスペース処理がシンプルになるかもしれません。
    # 例: re.split(r'(?<=[.!?])\s+', text)
    # ただし、ここでは元のロジックをベースに修正
    parts = re.split(r'([.!?])\s*', text)

    current_sentence = ""
    for part in parts:
        # 区切り文字の場合
        if part in ['.', '!', '?']:
            # 現在のセンテンスがあれば、区切り文字を付けてリストに追加
            if current_sentence.strip():
                sentences.append((current_sentence.strip() + part).strip())
                current_sentence = "" # リセット
            # else: # 区切り文字の前にテキストがない場合は無視（連続する区切り文字など）

        # テキスト部分の場合
        elif part.strip():
            # 現在のセンテンスにスペースを入れて追加（最初のパーツでなければ）
            if current_sentence:
                 current_sentence += " " + part.strip()
            else:
                 current_sentence = part.strip() # 新しいセンテンスの開始

    # 最後に残ったテキストがあれば追加（区切り文字で終わらない場合）
    if current_sentence.strip():
        sentences.append(current_sentence.strip())

    # 空の要素を削除（念のため）
    sentences = [s for s in sentences if s]

    return sentences


def chunk_text_by_length(sentences: List[str], min_chars: int, max_chars: int) -> List[str]:
    """
    センテンスリストを指定された文字数範囲でチャンクに分割する。
    最大文字数(max_chars)を基準に分割し、可能な限り最小文字数(min_chars)を考慮する。
    ただし、単一のセンテンスがmax_charsを超える場合や、残りのセンテンスがmin_charsに
    満たない場合は、この制約は満たされない。
    Args:
        sentences: センテンスのリスト
        min_chars: チャンクの最小文字数（目標）
        max_chars: チャンクの最大文字数（上限）
    Returns:
        テキストチャンクのリスト
    """
    chunks = []
    current_chunk_sentences = []
    current_chunk_length = 0

    for sentence in sentences:
        # 次のセンテンスを追加した時の長さを計算（現在のチャンクが空でなければスペース1文字分追加）
        # str.join(" ", current_chunk_sentences + [sentence]) の長さを計算するのがより正確かも
        # 現在のチャンクが空ならそのまま sentence の長さ、空でなければ current_chunk_length + space + sentence
        test_length = current_chunk_length + len(sentence) + (1 if current_chunk_sentences else 0)

        # 分割の判断ロジック:
        # 1. 現在のチャンクが空でない (current_chunk_sentences がある)
        # 2. 次のセンテンスを追加すると最大文字数を超える (test_length > max_chars)
        # 上記両方の条件を満たす場合に、現在のチャンクを確定して新しいチャンクを開始する。
        # このロジックは min_chars を直接の区切り判断には使用しないが、
        # 結果的にできるだけ多くのセンテンスをまとめることで min_chars を満たしやすくする。

        if current_chunk_sentences and test_length > max_chars:
            # 次のセンテンスを追加すると最大文字数を超える場合、現在のチャンクを確定
            chunks.append(" ".join(current_chunk_sentences))

            # 新しいチャンクとして現在のセンテンスを開始
            current_chunk_sentences = [sentence]
            current_chunk_length = len(sentence) # 新しいチャンクの長さは sentence の長さのみ
        else:
            # 最大文字数を超えない場合、現在のチャンクにセンテンスを追加
            current_chunk_sentences.append(sentence)
            current_chunk_length = test_length # 長さを更新

    # 最後に残ったチャンクがあれば追加
    if current_chunk_sentences:
        chunks.append(" ".join(current_chunk_sentences))

    # 注意点:
    # - 一つのセンテンスが max_chars を超える場合、そのセンテンスは単独のチャンクとなり max_chars を超えます。
    # - 最後のチャンクが、残りのセンテンスを全て結合しても min_chars に満たない場合があります。

    return chunks


def write_chunks_to_file(chunks: List[str], output_path: str, min_chars: int, max_chars: int) -> bool:
    """
    テキストチャンクのリストを指定されたファイルに書き込む
    Args:
        chunks: テキストチャンクのリスト
        output_path: 書き込み先のファイルパス
        min_chars: チャンクの最小文字数（ヘッダー情報用）
        max_chars: チャンクの最大文字数（ヘッダー情報用）
    Returns:
        書き込みが成功した場合はTrue、失敗した場合はFalse。
    """
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write("--- PDF抽出結果 ---")
            f.write(f"\n最小文字数: {min_chars}, 最大文字数: {max_chars}\n") # 引数を使用
            f.write("=" * 40 + "\n\n")

            if not chunks:
                 f.write("抽出されたチャンクはありませんでした。\n")
                 print("抽出されたチャンクはありませんでした。", file=sys.stderr)
                 return False

            for i, chunk in enumerate(chunks):
                # チャンク番号と文字数をヘッダーとして書き込む
                header = f"--- Chunk {i+1} (長さ: {len(chunk)} 文字) ---\n"
                f.write(header)
                # チャンク本体を書き込む
                f.write(chunk)
                f.write("\n") # チャンクの終わりに改行
                # 区切り線を書き込む
                f.write("-" * len(header.strip()) + "\n\n") # ヘッダーの長さに合わせた区切り線

        print(f"\n抽出結果をファイルに保存しました: {output_path}")
        return True

    except IOError as e:
        print(f"エラー: ファイルへの書き込み中にエラーが発生しました: {e}", file=sys.stderr)
        return False
    except Exception as e:
        print(f"ファイル書き込み中に予期せぬエラーが発生しました: {e}", file=sys.stderr)
        return False


if __name__ == "__main__":
    pdf_file = input("処理するPDFファイルのパスを入力してください: ")

    if not os.path.exists(pdf_file):
        print(f"エラー: ファイルが見つかりません: {pdf_file}", file=sys.stderr)
        sys.exit(1)

    try:
        min_chars_input = input("チャンクの最小文字数を入力してください: ")
        max_chars_input = input("チャンクの最大文字数を入力してください: ")

        min_chars = int(min_chars_input)
        max_chars = int(max_chars_input)

        if min_chars <= 0 or max_chars <= 0 or min_chars > max_chars:
             print("エラー: 文字数は正の整数で、最小文字数は最大文字数以下である必要があります。", file=sys.stderr)
             sys.exit(1)

    except ValueError:
        print("エラー: 文字数は整数で入力してください。", file=sys.stderr)
        sys.exit(1)

    output_file = input("結果を保存するファイルパスを入力してください (例: output.txt): ")
    if not output_file:
         print("エラー: 出力ファイルパスが指定されていません。", file=sys.stderr)
         sys.exit(1)

    print("\nPDFからテキストを抽出中...")
    text = extract_text_from_pdf(pdf_file)

    if text:
        print("テキストをセンテンスに分割中...")
        sentences = split_text_by_sentences(text)

        if sentences:
            print(f"テキストを約 {min_chars}〜{max_chars} 文字のチャンクに分割中...")
            chunks = chunk_text_by_length(sentences, min_chars, max_chars)

            print(f"\n抽出されたチャンク数: {len(chunks)}")
            print(f"結果を '{output_file}' に書き込みます...")

            # min_charsとmax_charsをwrite_chunks_to_fileに渡す
            if write_chunks_to_file(chunks, output_file, min_chars, max_chars):
                 print("処理が正常に完了しました。")
            else:
                 print("処理中にエラーが発生しました。", file=sys.stderr)
                 sys.exit(1) # 処理失敗として終了

        else:
            print("抽出されたテキストから有効なセンテンスを分割できませんでした。")
            sys.exit(1) # 処理失敗として終了

    else:
        print("テキスト抽出に失敗したため、処理を中断します。")
        sys.exit(1) # 処理失敗として終了

使い方

上記コードを適当なファイル名(pdf_split.py等)で保存し、下記のように実行します。

python pdf_split.py

スクリプトが実行されると、PDFファイルのパス、チャンクの最小・最大文字数、出力ファイルパスの入力を求められますので、画面の指示に従って入力してください。

注意事項

・本コードはGoogleの生成AI Geminiを使用して作成しました。

・センテンス分割は句読点（., ?, !）を基準に行う単純な方法のため、略語（例: Dr., Mr.）、小数点を含む数字（例: 3.14）、省略記号（...）などで意図せず分割されたり、逆に分割されない場合があります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up