HTMLファイルを統合してDeepLで翻訳し、その後元のディレクトリ構造に分割する方法

Posted at 2024-07-14

大量のHTMLファイルをDeepLで翻訳する際に、ファイルのアップロード回数制限を回避するための方法を紹介します。
この手法を使用することで、ディレクトリ階層が分かれている大量のHTMLファイルを一つのファイルに統合し、DeepLで翻訳した後、再度元の構造に戻すことができます。スクリプトを利用して、手動での作業を最小限に抑え、効率的に翻訳作業を行う事が出来ます。

手順

1. HTMLファイルの統合

ディレクトリ階層を保持しながら、全てのHTMLファイルを一つの大きなファイルに統合します。統合されたファイル内に、元のファイルパスを示す特別な識別子を挿入します。

統合スクリプト

以下のPythonスクリプトを使用して、HTMLファイルを統合します。読み込み元ディレクトリと出力ファイルを引数として指定します。

# combine_html.py
import os
import argparse

def combine_html_files(directory, outfile):
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.html'):
                filepath = os.path.join(root, file)
                with open(filepath, 'r', encoding='utf-8') as infile:
                    outfile.write(f"\n<!-- ***** FILE PATH: {filepath} ***** -->\n")
                    outfile.write(infile.read())
                    outfile.write(f"\n<!-- ***** END FILE ***** -->\n")

def main():
    parser = argparse.ArgumentParser(description='Combine HTML files into a single file.')
    parser.add_argument('input_dir', type=str, help='The root directory of HTML files to combine.')
    parser.add_argument('output_file', type=str, help='The output file to save the combined HTML content.')
    
    args = parser.parse_args()
    
    with open(args.output_file, 'w', encoding='utf-8') as outfile:
        combine_html_files(args.input_dir, outfile)
    
    print(f"All HTML files have been combined into {args.output_file}")

if __name__ == '__main__':
    main()

2. DeepLでの翻訳

統合されたファイルをDeepLにアップロードし、翻訳を行います。翻訳が完了したら、翻訳されたファイルをダウンロードします。

3. 翻訳ファイルの分割

翻訳されたファイルを元のディレクトリ構造を維持しながら分割します。統合時に挿入した識別子を使用して、元のファイルごとに分割します。

分割スクリプト

以下のPythonスクリプトを使用して、翻訳されたファイルを分割します。翻訳されたファイルと出力ディレクトリを引数として指定します。

# split_html.py
import os
import re
import argparse

def split_html_file(input_file, output_root_dir):
    # 入力ファイルを開く
    with open(input_file, 'r', encoding='utf-8') as infile:
        content = infile.read()

    # ファイルのセパレータで分割
    sections = re.split(r'<!-- \*\*\*\*\* (FILE PATH|END FILE): (.+?) \*\*\*\*\* -->', content)

    # 各セクションを元のファイル名とパスで保存
    current_file = None
    for i in range(1, len(sections), 3):
        marker = sections[i].strip()
        filepath = sections[i + 1].strip()
        html_content = sections[i + 2]
        
        if marker == "FILE PATH":
            current_file = filepath
            output_path = os.path.join(output_root_dir, current_file)
            os.makedirs(os.path.dirname(output_path), exist_ok=True)
            with open(output_path, 'w', encoding='utf-8') as outfile:
                outfile.write(html_content)
        elif marker == "END FILE" and current_file == filepath:
            output_path = os.path.join(output_root_dir, current_file)
            with open(output_path, 'a', encoding='utf-8') as outfile:
                outfile.write(html_content)

def main():
    parser = argparse.ArgumentParser(description='Split translated HTML file into individual files with original directory structure.')
    parser.add_argument('input_file', type=str, help='The translated HTML file to split.')
    parser.add_argument('output_dir', type=str, help='The root directory to save the split HTML files.')
    
    args = parser.parse_args()
    
    split_html_file(args.input_file, args.output_dir)
    
    print("Translated HTML files have been split into individual files with original directory structure.")

if __name__ == '__main__':
    main()

スクリプトの実行方法

統合スクリプトの実行

python combine_html.py /path/to/html/files combined.html

分割スクリプトの実行

python split_html.py translated_combined.html /path/to/output/files

試してみた

以下URLで掲載されているProxmox VEの日本語マニュアル(manpages)を対象に、スクリプトの動作確認を行ってみました。
https://pve.proxmox.com/pve-docs/

結合

ProxmoxVEManpagesディレクトリ配下のファイルを、combined.htmlに結合します。

$ python3 combine.py ProxmoxVEManpages combined.html
All HTML files have been combined into combined.html
$ ls -laF
total 3456
drwxrwxrwx 1 flathill flathill     512 Jul 15 03:08 ./
drwxrwxrwx 1 flathill flathill     512 Feb 29 15:35 ../
drwxrwxrwx 1 flathill flathill     512 Jul 15 02:54 ProxmoxVEManpages/
-rwxrwxrwx 1 flathill flathill    1117 Jul 15 03:12 combine.py*
-rwxrwxrwx 1 flathill flathill 3156197 Jul 15 03:22 combined.html*
-rwxrwxrwx 1 flathill flathill    1796 Jul 15 03:12 extract.py*

分割

DeepLで日本語訳されたcombined_ja.htmlを、jaディレクトリ配下に展開します。

$ python3 extract.py combined_ja.html ja
Translated HTML files have been split into individual files with original directory structure.
$ ls -alF ja/
total 0
drwxrwxrwx 1 flathill flathill 512 Jul 15 03:46 ./
drwxrwxrwx 1 flathill flathill 512 Jul 15 03:08 ../
drwxrwxrwx 1 flathill flathill 512 Jul 15 03:46 ProxmoxVEManpages/
$ ls -alF ja/ProxmoxVEManpages/
total 5760
drwxrwxrwx 1 flathill flathill    512 Jul 15 03:46 ./
drwxrwxrwx 1 flathill flathill    512 Jul 15 03:46 ../
-rwxrwxrwx 1 flathill flathill  27301 Jul 15 03:46 cpu-models.conf.5.html*
-rwxrwxrwx 1 flathill flathill  47538 Jul 15 03:46 datacenter.cfg.5.html*
-rwxrwxrwx 1 flathill flathill 164430 Jul 15 03:46 ha-manager.1.html*
-rwxrwxrwx 1 flathill flathill 342467 Jul 15 03:46 pct.1.html*
-rwxrwxrwx 1 flathill flathill  66063 Jul 15 03:46 pct.conf.5.html*
-rwxrwxrwx 1 flathill flathill  56129 Jul 15 03:46 pmxcfs.8.html*
-rwxrwxrwx 1 flathill flathill 206042 Jul 15 03:46 pve-firewall.8.html*
-rwxrwxrwx 1 flathill flathill  23480 Jul 15 03:46 pve-ha-crm.8.html*
-rwxrwxrwx 1 flathill flathill  23542 Jul 15 03:46 pve-ha-lrm.8.html*
-rwxrwxrwx 1 flathill flathill  24946 Jul 15 03:46 pveam.1.html*
-rwxrwxrwx 1 flathill flathill 236472 Jul 15 03:46 pveceph.1.html*
-rwxrwxrwx 1 flathill flathill 224603 Jul 15 03:46 pvecm.1.html*
-rwxrwxrwx 1 flathill flathill  27545 Jul 15 03:46 pvedaemon.8.html*
-rwxrwxrwx 1 flathill flathill  63474 Jul 15 03:46 pvenode.1.html*
-rwxrwxrwx 1 flathill flathill  22286 Jul 15 03:46 pveperf.1.html*
-rwxrwxrwx 1 flathill flathill  54220 Jul 15 03:46 pveproxy.8.html*
-rwxrwxrwx 1 flathill flathill  24013 Jul 15 03:46 pvescheduler.8.html*
-rwxrwxrwx 1 flathill flathill  33883 Jul 15 03:46 pvesh.1.html*
-rwxrwxrwx 1 flathill flathill 319191 Jul 15 03:46 pvesm.1.html*
-rwxrwxrwx 1 flathill flathill  61446 Jul 15 03:46 pvesr.1.html*
-rwxrwxrwx 1 flathill flathill  23686 Jul 15 03:46 pvestatd.8.html*
-rwxrwxrwx 1 flathill flathill  24707 Jul 15 03:46 pvesubscription.1.html*
-rwxrwxrwx 1 flathill flathill 236024 Jul 15 03:46 pveum.1.html*
-rwxrwxrwx 1 flathill flathill 664907 Jul 15 03:46 qm.1.html*
-rwxrwxrwx 1 flathill flathill 189875 Jul 15 03:46 qm.conf.5.html*
-rwxrwxrwx 1 flathill flathill  23257 Jul 15 03:46 qmeventd.8.html*
-rwxrwxrwx 1 flathill flathill  24239 Jul 15 03:46 qmrestore.1.html*
-rwxrwxrwx 1 flathill flathill  24924 Jul 15 03:46 spiceproxy.8.html*
-rwxrwxrwx 1 flathill flathill 140618 Jul 15 03:46 vzdump.1.html*

GitHub リポジトリ

分割・結合スクリプトの詳細と最新バージョンは、GitHubリポジトリをご参照ください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up