2
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

PDFをテキスト変換

Posted at

1.なにこれ

PDFからテキストに変換する簡単なPythonスクリプトを紹介します。

2.前準備

2.1. 環境について

 環境:Python3.11.3
    llama-index==0.6.27

2.2. llama-index

 最近はやりのChatGPTに外部データを食わせるためのライブラリです。
 今回はPDFのドキュメントの抽出に、この機能の一部を利用します。

3.実行

構成
pdf2txt.py
  └─ document
     ├─ pdf   # ここにPDF配置
     └─ txt   # 変換後のTEXTの置き場
pdf2txt.py
import glob
from pathlib import Path
from llama_index import download_loader

READ_PATH = "XXX\\document\\pdf\\*"
WRITE_PATH = "XXX\\document\\txt\\"
EXTENSION_TEXT = ".txt"

CJKPDFReader = download_loader("CJKPDFReader")
loader = CJKPDFReader()

files = glob.glob(READ_PATH)
for file in files:
    org_file = Path(file)
    out_file = Path(WRITE_PATH + org_file.stem + EXTENSION_TEXT)
    if not out_file.exists():
        documents = loader.load_data(file=org_file)
        with open(out_file, mode='w', encoding='UTF-8') as f:
            f.write(documents[0].text)

初回起動は必要なライブラリのダウンロードで時間がかかります。

2
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?