PDFをまとめて翻訳したい

PDFを翻訳したい場合、Google翻訳の標準機能がまず思い当たりますが、ファイルサイズの上限があってPDFによっては困ります。
前回の記事ではGoogle Translate APIを叩いて英文を日本語化しました。読みたい文章がPDFの場合にGoogleドキュメントやAdobe Acrobatを使ってテキストを取り出すことが考えられますが、工程数が多い難点があります。ここもPDF Minerというライブラリを使うことで一連の作業をPythonにやらせることができそうです。
以下の記事を大変参考にさせていただきました。
【PDFMiner】PDFからテキストの抽出

作成したスクリプトは以下にあります。
https://github.com/KanikaniYou/translate_pdf

あるフォルダ内にある全てのPDFについて、抽出・翻訳・テキストファイルとして出力します。

PDFMinerでの取り出し後に要らない文字列まで取ってしまうので("......"と同じ記号が続く文字列など。目次とかにあるよね。)、抽出後のテキストファイルを中間ファイルとしてまとめておき、人の手で要らない部分を取ってやることができるようにしています。

翻訳作業の全体の流れとしては
0. クイックスタート | Google Cloud Translation API Documentation | Google Cloud Platformを参考にGoogle Translate APIを取得
1. PDFからテキストを取り出してテキストファイルとして保存する(pdf_to_txt.py)
2. テキストを整形して一行の新しいテキストファイルとして保存する(let_translatable.py)
3. 英文を日本語化してテキストファイルとして保存する(translate_en_jp.py)

となります。先述の通り、1.のあとで人手でテキストファイルを見てやることで、PDF Minerで抽出したテキストをチェックし、Google Translate APIを無駄に叩かなくて済むように必要箇所だけ取り出すということができます。

環境

Python3系の入ってるLinuxならいけると思います。手元の環境は Cloud9でUbuntu 18.04です。

pip install pdfminer.six

ちなみにPDF Minerはとても有用なのですが、日本語などを取り出したい時に文字化けが起こりやすいようです。今回は英語を取り出すのでそこまで頻繁に問題は起きないと思います。

PDF Minerでの日本語取り出しに関する既知の不具合： Still have issues with CID Characters #39

git clone https://github.com/KanikaniYou/translate_pdf
cd translate_pdf

ファイル構成です。(説明上、翻訳したいPDF10個をすでに置いてあります。)

.
├── eng_txt
├── eng_txt_split
├── jpn_txt
├── let_translatable.py
├── pdf_source
│   ├── report_1.pdf
│   ├── report_10.pdf
│   ├── report_2.pdf
│   ├── report_3.pdf
│   ├── report_4.pdf
│   ├── report_5.pdf
│   ├── report_6.pdf
│   ├── report_7.pdf
│   ├── report_8.pdf
│   └── report_9.pdf
├── pdf_to_txt.py
└── translate_en_jp.py

1.テキストの取り出し

pdf_to_txt.py

import sys

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

import os
import re

def find_textboxes_recursively(layout_obj):
    if isinstance(layout_obj, LTTextBox):
        return [layout_obj]

    if isinstance(layout_obj, LTContainer):
        boxes = []
        for child in layout_obj:
            boxes.extend(find_textboxes_recursively(child))
        return boxes
    return[]

def pdf_read_controller(filepath):
    try:
        text_in_pdf = ""

        with open(filepath, 'rb') as f:

            for page in PDFPage.get_pages(f):
                try:

                    interpreter.process_page(page)
                    layout = device.get_result()

                    boxes = find_textboxes_recursively(layout)
                    boxes.sort(key=lambda b:(-b.y1, b.x0))

                    text_in_page = ""
                    for box in boxes:
                        text_in_box = ""

                        text_in_box += box.get_text().strip().strip(" ")

                        text_in_box.rstrip("\n")
                        text_in_box = re.sub(r'  ', " ", text_in_box)

                        text_in_page += text_in_box
                    text_in_pdf += text_in_page
                except Exception as e:
                    print(e)

        return(text_in_pdf)

    except Exception as e:
        print(e)
        print("error: " + filepath)
        return("no-text")


def make_txtfile(folder_path,file_name,text='error'):
    if text != "no-text":
        with open(folder_path+"/"+file_name, mode='w') as f:
            f.write(text)


laparams = LAParams(detect_vertical=True)
resource_manager = PDFResourceManager()
device = PDFPageAggregator(resource_manager, laparams=laparams)
interpreter = PDFPageInterpreter(resource_manager, device)

if __name__ == '__main__':
    for file_name in os.listdir("pdf_source"):
        if file_name.endswith(".pdf"):
            print(file_name)
            text_in_page = pdf_read_controller("pdf_source/" + file_name)
            make_txtfile("eng_txt_split",file_name.rstrip("pdf")+"txt",text_in_page)

フォルダ「pdf_source」下のpdfを全部読んでテキストファイルを作って出力します。

 $ python pdf_to_txt.py
report_3.pdf
report_7.pdf
report_2.pdf
report_1.pdf
unpack requires a buffer of 10 bytes
unpack requires a buffer of 8 bytes
report_5.pdf
report_9.pdf
report_8.pdf
unpack requires a buffer of 6 bytes
unpack requires a buffer of 6 bytes
unpack requires a buffer of 4 bytes
report_4.pdf
report_6.pdf
report_10.pdf

いくつかエラーが出ていますが、無視してテキストファイルを作成します。PDFはややこしいですね。
似たようなエラー：struct.error: unpack requires a string argument of length 16

1-2.テキストの目視チェック(しなくてもOK)

例えばテキストの一部にこういった箇所が含まれていました。Google Translate APIを無駄に叩きたくないので要らないところは削除してやりましょう。

2.テキストの整形

let_translatable.py

import os

if __name__ == '__main__':
    for file_name in os.listdir("eng_txt_split"):
        if file_name.endswith(".txt"):
            print(file_name)
            text = ""
            with open("eng_txt_split/"+file_name) as f:
                l = f.readlines()
                for line in l:
                    text += str(line).rstrip('\n')

            path_w = "eng_txt/" + file_name
            with open(path_w, mode='w') as f:
                f.write(text)

PDF Minerで出てくるテキストは改行だらけで、このままGoogle翻訳に入れると上手く翻訳してくれそうにありません。そこで、改行を無くしたテキストファイルを新しく作成し、フォルダ eng_txt に出力してやります。

$ python let_translatable.py
report_4.txt
report_10.txt
report_2.txt
report_6.txt
report_9.txt
report_5.txt
report_8.txt
report_7.txt
report_3.txt
report_1.txt

3.英語→日本語に翻訳！

できたテキストをいよいよ翻訳します。中身は前記事を参考にしていただければと思います。

translate_en_jp.py

import requests
import json
import os
import re
import time

API_key = '<ここにAPIキーを入れてください>'
def post_text(text):
    url_items = 'https://www.googleapis.com/language/translate/v2'
    item_data = {
        'target': 'ja',
        'source': 'en',
        'q':text
    }
    response = requests.post('https://www.googleapis.com/language/translate/v2?key={}'.format(API_key), data=item_data)
    return response.text

def jsonConversion(jsonStr):
    data = json.loads(jsonStr)
    return data["data"]["translations"][0]["translatedText"]

def split_text(text):
    sen_list = text.split('.')

    to_google_sen = ""
    from_google = ""

    for index, sen in enumerate(sen_list[:-1]):
        to_google_sen += sen + '. '
        if len(to_google_sen)>1000:
            from_google += jsonConversion(post_text(to_google_sen)) +'\n'
            time.sleep(1)

            to_google_sen = ""
        if index == len(sen_list)-2:
            from_google += jsonConversion(post_text(to_google_sen))
            time.sleep(1)
    return from_google


if __name__ == '__main__':
    for file_name in os.listdir("eng_txt"):
        print("source: " + file_name)
        with open("eng_txt/"+file_name) as f:
            s = f.read()
            new_text = split_text(s)
            path_w = "jpn_txt/" + file_name
            with open(path_w, mode='w') as f:
                f.write(new_text)

 $ python translate_en_jp.py
source: report_4.txt
source: report_10.txt
source: report_2.txt
source: report_6.txt
source: report_9.txt
source: report_5.txt
source: report_8.txt
source: report_7.txt
source: report_3.txt
source: report_1.txt

長いテキストだとちょっと時間がかかります。

成果物

jpn_txtフォルダ内に翻訳後のテキストファイルが入ります。

これで英文PDFに悩まされずに済みそうですね！
もっとも、これで出力されるテキストはレイアウトなんて概念はなくなってますし、ページ間などでは上手く翻訳されないこともあると思います。本来はその辺の処理もできればいいんでしょうけど、なかなか難しそうです。あくまでたくさんあるPDFについて日本語で目を通したいという際に使って頂ければと思います。

英文PDFをまとめて日本語化

PDFをまとめて翻訳したい

環境

1.テキストの取り出し

1-2.テキストの目視チェック(しなくてもOK)

2.テキストの整形

3.英語→日本語に翻訳！

成果物