More than 1 year has passed since last update.

DeepL API を使ってHTMLを雑に翻訳する

Last updated at 2023-07-16Posted at 2023-07-16

概要

コマンドでwebページ全体をDeepLで翻訳できたらいいな¹

と思っていたら、公式がpythonライブラリdeepl-pythonを提供していたのでスクリプトを書いてみました。

サンプルスクリプト並みのクオリティですが、そこそこ使えたので共有してみます。

スクリプト

import deepl
import requests
import sys
from bs4 import BeautifulSoup
from urllib.parse import urljoin


auth_key = 'XXXXXX'


# 変換後も元のリンクを参照できるように、相対から絶対リンクに変換
def convert_relative_to_absolute(html, base_url):
    soup = BeautifulSoup(html, 'html.parser')

    for atag in soup.find_all('a'):
        href = atag.get('href')
        if href and not href.startswith(('http://', 'https://', '#')):
            absolute_url = urljoin(base_url, href)
            atag['href'] = absolute_url

    for link in soup.find_all('link'):
        href = link.get('href')
        if href and not href.startswith(('http://', 'https://')):
            absolute_url = urljoin(base_url, href)
            link['href'] = absolute_url

    for elem in soup.find_all('img'):
        href = elem.get('src')
        if href and not href.startswith(('http://', 'https://')):
            absolute_url = urljoin(base_url, href)
            elem['src'] = absolute_url

    return str(soup)


def translate(html):
    translator = deepl.Translator(auth_key)
    target_lang = 'JA'
    return translator.translate_text(
        html, target_lang=target_lang, tag_handling='html',
        splitting_tags=['pre'],  # codingブロックの翻訳
    )


url = sys.argv[1]

raw_html = requests.get(url).content.decode()
base_url = '/'.join(url.split('/')[:-1])
html = convert_relative_to_absolute(raw_html, base_url)

result = translate(html)

if isinstance(result, deepl.translator.TextResult):
    post_html = result.text
else:
    # 結果が分割されることがあるので結合
    post_html = [r.text for r in result]

print(post_html)

auth_key の取得は以下をご参照ください。

できること

1コマンドでwebページ全体を翻訳し、htmlを出力することができます。

python html_translator.py https://en.wikipedia.org/wiki/Zen_of_Python > zen_of_python.html

↓

できないこと

長すぎるhtmlの翻訳
- 413 Request Entity Too Largeとなり翻訳できません
改行の再現
- 特にRFC等、改行文字で改行が表現される場合つぶれてしまいます
完全なアーカイブ
- cssや画像は外部を参照しています
無料枠リソースの節約
- html全体を翻訳するので、無料枠である月50万文字を湯水のように消費してしまいます

さいごに

より高度なことをしたい方はぜひ拡張してみてください！

chrome拡張でも十分だと思います。コマンドにした理由は、自動化しやすいそうというのと、直近でchrome拡張上のログインがうまくできなかったことぐらいです。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up