More than 1 year has passed since last update.

PDFをテキスト変換

Posted at 2023-06-18

１．なにこれ

PDFからテキストに変換する簡単なPythonスクリプトを紹介します。

２．前準備

2.1. 環境について

　環境：Python3.11.3
　　　　llama-index==0.6.27

2.2. llama-index

　最近はやりのChatGPTに外部データを食わせるためのライブラリです。
　今回はPDFのドキュメントの抽出に、この機能の一部を利用します。

３．実行

構成

pdf2txt.py
　　└─ document
　　　　　├─ pdf   # ここにPDF配置
　　　　　└─ txt   # 変換後のTEXTの置き場

pdf2txt.py

import glob
from pathlib import Path
from llama_index import download_loader

READ_PATH = "XXX\\document\\pdf\\*"
WRITE_PATH = "XXX\\document\\txt\\"
EXTENSION_TEXT = ".txt"

CJKPDFReader = download_loader("CJKPDFReader")
loader = CJKPDFReader()

files = glob.glob(READ_PATH)
for file in files:
    org_file = Path(file)
    out_file = Path(WRITE_PATH + org_file.stem + EXTENSION_TEXT)
    if not out_file.exists():
        documents = loader.load_data(file=org_file)
        with open(out_file, mode='w', encoding='UTF-8') as f:
            f.write(documents[0].text)

初回起動は必要なライブラリのダウンロードで時間がかかります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up