More than 5 years have passed since last update.

[python]pdf からテキストを抽出して、Open-Jtalkで文字を読み上げる

Posted at 2020-08-23

PDFのテキストを抽出する
PythonのpdfminerでPDFのテキストを抽出する方法を現役エンジニアが解説【初心者向け】

$pip install pdfminer.six

.py

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

input_path = '抽出したPDFのパス'
output_path = 'result.txt'

manager = PDFResourceManager()

with open(output_path, "wb") as output:
    with open(input_path, 'rb') as input:
        with TextConverter(manager, output, codec='utf-8', laparams=LAParams()) as conv:
            interpreter = PDFPageInterpreter(manager, conv)
            for page in PDFPage.get_pages(input):
                interpreter.process_page(page)

Open JTalkのインストール

Pythonで音声を操作する方法
 Pythonでテキストを読み上げる方法
以上2つのサイトを参考にさせて頂きました（というかほぼそのままです...）ありがとうございます。

Open JTalkのversionを1.11に書き換えました。

より人間っぽく読み上げるには以下のような記事を参考にすると良さそうです。
読み上げBotが感情を持ちました

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up