YomiTokuをGPUクラウドサービスModalで実行する

Posted at 2025-04-17

YomiToku は日本語に特化して学習したAI OCRです。Pythonパッケージとして公開されており、インストールすればCLIで簡単にOCR解析が可能です。APIもありますので、Pythonからの操作も可能です。

今回は、GPUクラウドサービスのModal上でYomiTokuを利用する手順を紹介します。手元にGPUがなくとも、Modalを使うことで高度なAI OCRを利用できます。

注意点

YomiToku自体には yomitoku というCLIコマンドが用意されていますので、必ずしもプログラミングが必要な訳ではありません。今回はModalと組み合わせた場合の実装方法として参考にしてください。

Modalとは

ModalはGPUクラウドサービスです。ローカルで記述したPythonのコードを、クラウド上で実行できます。 modal コマンドを使ってPythonコードを実行するので、まるでローカルで処理しているかのようにGPUを使ったコードを実行できます。

インストール

Modalのインストールは pip コマンドで行います。

pip install modal

インストール後、セットアップを行います。

python -m modal setup

このコマンドを実行すると、ブラウザが立ち上がってModalの紐付け処理を行います。アカウント自体はGitHubやGoogleアカウントで作成できます。

Modalの基本形

最も基本的なModalのコードです。

import sys

import modal

app = modal.App("example-hello-world")

@app.function()
def f(i):
    if i % 2 == 0:
        print("hello", i)
    else:
        print("world", i, file=sys.stderr)

    return i * i

@app.local_entrypoint()
def main():
    # ローカルで実行
    print(f.local(1000))

    # リモートで実行
    print(f.remote(1000))

    # リモート、かつ並列で実行
    total = 0
    for ret in f.map(range(200)):
        total += ret

    print(total)

@app.function() というデコレータをつけた関数が、Modal上で実行される関数になります。この関数は、Modal上のコンテナで実行されます。@app.local_entrypoint() というデコレータをつけた関数が、ローカルで最初に実行される関数になります。そして、関数名の後ろに .local() や .remote() などをつけることで、ローカルで実行するか、リモートで実行するかを指定できます。

注意点としては、複数の関数があった場合に、ファイルシステムが異なることです。そのため、ファイルの受け渡しなどは、Volumes という仕組みを使って行います。

YomiTokuをModalで実行する

では、YomiTokuをModalで実行する手順を紹介します。今回はローカルにある invoices ディレクトリにあるファイルを解析対象とします。

import modal
import os
cwd = os.getcwd()
app = modal.App("yomitoku-modal")

@app.local_entrypoint()
def main():
    # invoices ディレクトリにあるファイルを取得
    entries = os.listdir(f"{cwd}/invoices")
    for entry in entries:
        # コンテナ内では、 `/invoices` にマウントしています
        path_image = f"/invoices/{entry}"
        yomitoku_function.remote(path_image)

YomiTokuや、YomiTokuで利用するOpenCVなどのパッケージをインストールするためのイメージを作成します。Imageについては、Images | Modal Docsを参照してください。

open_cv_image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("python3-opencv")
    .pip_install(
        "yomitoku",
        "opencv-python~=4.10.0",
    )
    .add_local_dir(
        f'{cwd}/invoices',
        remote_path="/invoices",
    )
)

さらに、解析結果を保存するボリュームを用意します。

volume = modal.Volume.from_name("yomitoku", create_if_missing=True)

そして、このイメージとボリュームを関数のデコレータで指定します。

@app.function(gpu="any", image=open_cv_image, volumes={"/results": volume })
def yomitoku_function(image_path: str):
  # この中に処理を書きます

yomitoku_functionの実装

関数内部は異なるコンテナで実行されますので、必要なライブラリは関数内で読み込みます。

import json
import cv2
import io
import csv
from yomitoku import DocumentAnalyzer
from yomitoku.data.functions import load_pdf

次に必要な変数を用意します。

# 出力ファイル名を取得
output_file_name = image_path.split("/")[-1].split(".")[0]
# 出力を作成（CSV用）
output = io.StringIO()
writer = csv.writer(output, quoting=csv.QUOTE_MINIMAL)
# DocumentAnalyzerの初期化
analyzer = DocumentAnalyzer(visualize=True, device="cuda")
# ドキュメントの読み込み
imgs = load_pdf(image_path)

imgs はページ数に応じて複数返ってきますので、それぞれのページに対して解析を行います。

for i, img in enumerate(imgs):
    # この中で処理をします

AI OCRの処理自体は、以下の処理を実行するだけです。

# AI OCRの実行
results, ocr_vis, layout_vis = analyzer(img)

そして、結果をHTML/JSON/Markdown/CSV形式で出力します。変換結果が受け取れるので、その内容をファイル出力します。JSONの場合はdictなので、JSON文字列に変換します。また、CSVの場合は配列になっているので、 csv を使ってCSV文字列に変換します。

# HTML形式で解析結果をエクスポート
with open(f"/results/{output_file_name}_{i}.html", "w") as f:
    html = results.to_html("", img=img)
    f.write(html)
print(f"HTML形式で解析結果をエクスポート: /results/{output_file_name}_{i}.html")
# JSON形式で解析結果をエクスポート
with open(f"/results/{output_file_name}_{i}.json", "w") as f:
    f.write(json.dumps(results.to_json("", img=img)))
print(f"JSON形式で解析結果をエクスポート: /results/{output_file_name}_{i}.json")
# Markdown形式で解析結果をエクスポート
with open(f"/results/{output_file_name}_{i}.md", "w") as f:
    f.write(results.to_markdown("", img=img))
print(f"Markdown形式で解析結果をエクスポート: /results/{output_file_name}_{i}.md")
# CSV形式で解析結果をエクスポート
with open(f"/results/{output_file_name}_{i}.csv", "w", newline="", encoding="utf-8", errors="ignore") as f:
    writer = csv.writer(output, quoting=csv.QUOTE_MINIMAL)
    elements = results.to_csv("", img=img)
    for element in elements:
        if element["type"] == "table":
            writer.writerows(element["element"])
        else:
            writer.writerow([element["element"]])

        writer.writerow([""])
    f.write(output.getvalue())
print(f"CSV形式で解析結果をエクスポート: /results/{output_file_name}_{i}.csv")

さらに解析した結果のレイアウトファイルなども保存します。

# 可視化画像を保存
print(f"OCR結果の可視化画像を保存: /results/{output_file_name}_ocr_{i}.jpg")
cv2.imwrite(f"/results/{output_file_name}_ocr_{i}.jpg", ocr_vis)
print(f"レイアウト解析結果の可視化画像を保存: /results/{output_file_name}_layout_{i}.jpg")
cv2.imwrite(f"/results/{output_file_name}_layout_{i}.jpg", layout_vis)

これで処理は完成です。

全体のコード

最終的なコードは以下のようになります。

import modal
import os
cwd = os.getcwd()
app = modal.App("yomitoku-modal")

volume = modal.Volume.from_name("yomitoku", create_if_missing=True)

open_cv_image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("python3-opencv")
    .pip_install(
        "yomitoku",
        "opencv-python~=4.10.0",
    )
    .add_local_dir(
        f'{cwd}/invoices',
        remote_path="/invoices",
    )
)

@app.function(gpu="any", image=open_cv_image, volumes={"/results": volume })
def yomitoku_function(image_path: str):
    import json
    import cv2
    import io
    import csv
    from yomitoku import DocumentAnalyzer
    from yomitoku.data.functions import load_pdf
    # 出力ファイル名を取得
    output_file_name = image_path.split("/")[-1].split(".")[0]
    # 出力を作成（CSV用）
    output = io.StringIO()
    writer = csv.writer(output, quoting=csv.QUOTE_MINIMAL)
    # DocumentAnalyzerの初期化
    analyzer = DocumentAnalyzer(visualize=True, device="cuda")
    # ドキュメントの読み込み
    imgs = load_pdf(image_path)
    for i, img in enumerate(imgs):
        # AI OCRの実行
        results, ocr_vis, layout_vis = analyzer(img)
        # HTML形式で解析結果をエクスポート
        with open(f"/results/{output_file_name}_{i}.html", "w") as f:
            html = results.to_html("", img=img)
            f.write(html)
        print(f"HTML形式で解析結果をエクスポート: /results/{output_file_name}_{i}.html")
        # JSON形式で解析結果をエクスポート
        with open(f"/results/{output_file_name}_{i}.json", "w") as f:
            f.write(json.dumps(results.to_json("", img=img)))
        print(f"JSON形式で解析結果をエクスポート: /results/{output_file_name}_{i}.json")
        # Markdown形式で解析結果をエクスポート
        with open(f"/results/{output_file_name}_{i}.md", "w") as f:
            f.write(results.to_markdown("", img=img))
        print(f"Markdown形式で解析結果をエクスポート: /results/{output_file_name}_{i}.md")
        # CSV形式で解析結果をエクスポート
        with open(f"/results/{output_file_name}_{i}.csv", "w", newline="", encoding="utf-8", errors="ignore") as f:
            writer = csv.writer(output, quoting=csv.QUOTE_MINIMAL)
            elements = results.to_csv("", img=img)
            for element in elements:
                if element["type"] == "table":
                    writer.writerows(element["element"])
                else:
                    writer.writerow([element["element"]])

                writer.writerow([""])
            f.write(output.getvalue())
        print(f"CSV形式で解析結果をエクスポート: /results/{output_file_name}_{i}.csv")
        # 可視化画像を保存
        print(f"OCR結果の可視化画像を保存: /results/{output_file_name}_ocr_{i}.jpg")
        cv2.imwrite(f"/results/{output_file_name}_ocr_{i}.jpg", ocr_vis)
        print(f"レイアウト解析結果の可視化画像を保存: /results/{output_file_name}_layout_{i}.jpg")
        cv2.imwrite(f"/results/{output_file_name}_layout_{i}.jpg", layout_vis)
    output.close()
    print("処理が完了しました。")

@app.local_entrypoint()
def main():
    entries = os.listdir(f"{cwd}/invoices")
    for entry in entries:
        PATH_IMGE = f"/invoices/{entry}"
        yomitoku_function.remote(PATH_IMGE)

実行する

今回のコードを runner.py として保存し、 modal コマンドで実行します。

% modal run runner.py
✓ Initialized. View run at https://modal.com/apps/goofmint/main/ap-K1wZJQ9AlDxbsQJySIuhEG
✓ Created objects.
├── 🔨 Created mount /path/to/runner.py
├── 🔨 Created mount /path/to/invoices
└── 🔨 Created function yomitoku_function.
2025-03-12 05:39:10,245 - yomitoku.base - INFO - Initialize TextDetector
2025-03-12 05:39:11,889 - yomitoku.base - INFO - Initialize TextRecognizer
2025-03-12 05:39:13,780 - yomitoku.base - INFO - Initialize LayoutParser
2025-03-12 05:39:15,228 - yomitoku.base - INFO - Initialize TableStructureRecognizer
2025-03-12 05:39:18,431 - yomitoku.base - INFO - TextDetector __call__ elapsed_time: 1.8317437171936035
2025-03-12 05:39:18,731 - yomitoku.base - INFO - LayoutParser __call__ elapsed_time: 2.130420684814453
2025-03-12 05:39:18,892 - yomitoku.base - INFO - TableStructureRecognizer __call__ elapsed_time: 0.16080307960510254
2025-03-12 05:39:22,152 - yomitoku.base - INFO - TextRecognizer __call__ elapsed_time: 3.2309658527374268
HTML形式で解析結果をエクスポート: /results/invoice_0.html
JSON形式で解析結果をエクスポート: /results/invoice_0.json
Markdown形式で解析結果をエクスポート: /results/invoice_0.md
CSV形式で解析結果をエクスポート: /results/invoice_0.csv
OCR結果の可視化画像を保存: /results/invoice_ocr_0.jpg
レイアウト解析結果の可視化画像を保存: /results/invoice_layout_0.jpg
処理が完了しました。
2025-03-12 05:39:22,635 - yomitoku.base - INFO - Initialize TextDetector
2025-03-12 05:39:23,312 - yomitoku.base - INFO - Initialize TextRecognizer
2025-03-12 05:39:24,112 - yomitoku.base - INFO - Initialize LayoutParser
2025-03-12 05:39:24,585 - yomitoku.base - INFO - Initialize TableStructureRecognizer
2025-03-12 05:39:25,345 - yomitoku.base - INFO - LayoutParser __call__ elapsed_time: 0.13031840324401855
2025-03-12 05:39:25,345 - yomitoku.base - INFO - LayoutParser wrapper elapsed_time: 0.13048863410949707
2025-03-12 05:39:25,568 - yomitoku.base - INFO - TextDetector __call__ elapsed_time: 0.3536806106567383
2025-03-12 05:39:25,568 - yomitoku.base - INFO - TextDetector wrapper elapsed_time: 0.3538353443145752
2025-03-12 05:39:25,604 - yomitoku.base - INFO - TableStructureRecognizer __call__ elapsed_time: 0.2582571506500244
2025-03-12 05:39:25,604 - yomitoku.base - INFO - TableStructureRecognizer wrapper elapsed_time: 0.2584211826324463
2025-03-12 05:39:28,731 - yomitoku.base - INFO - TextRecognizer __call__ elapsed_time: 3.09946608543396
2025-03-12 05:39:28,731 - yomitoku.base - INFO - TextRecognizer wrapper elapsed_time: 3.0996358394622803
HTML形式で解析結果をエクスポート: /results/template_02_0.html
JSON形式で解析結果をエクスポート: /results/template_02_0.json
Markdown形式で解析結果をエクスポート: /results/template_02_0.md
CSV形式で解析結果をエクスポート: /results/template_02_0.csv
OCR結果の可視化画像を保存: /results/template_02_ocr_0.jpg
レイアウト解析結果の可視化画像を保存: /results/template_02_layout_0.jpg
処理が完了しました。
Stopping app - local entrypoint completed.
Runner terminated.

そして、結果ファイルがModalのストレージに保存されていれば完成です。

ダウンロードする

結果ファイルをダウンロードするのも modal コマンドで行えます。

% modal volume get yomitoku / results 
⠙ Downloading file(s) to local...
Downloading file(s) to local... 0:00:12 ━━ (12 out of 19 files completed)
  output_ocr_0.jpg ━━━ 0.0% • 0.0/704.6 kB • ? • -:--:--
  output_layout_0.jpg ━━━ 0.0% • 0.0/959.8 kB • ? • -:--:--
  output_0.md ━━━ 0.0% • 0.0/1.5 kB   • ? • -:--:--
  output_0.json ━━━ 0.0% • 0.0/38.0 kB  • ? • -:--:--
  output_0.html ━━━ 0.0% • 0.0/7.3 kB   • ? • -:--:--
  output_0.csv ━━━ 0.0% • 0.0/1.4 kB   • ? • -:--:--
figures/output_0_figure_0.png ━━━ 0.0% • 0.0/18.9 kB  • ? • -:--:--

これで、ローカルに結果ファイルがダウンロードされていれば成功です。

今回OCR処理したPDFファイルの内容です。

以下はOCR結果（Markdown）です。日本語が適切に読み取られ、ヘッダーや明細部分も適切に読み取られています

ライセンス

YomiTokuのライセンスはコモンズ証 - 表示 - 非営利 - 継承 4.0 国際 - Creative Commonsです。非商用での個人利用、研究目的においては、ご自由に利用できます。商用目的での利用に関しては、商用ライセンスが必要です。

まとめ

YomiTokuは決して高性能なGPUが必要な訳ではありませんが、CPUでの実行は時間がかかります。macOSのように cuda に対応していない場合も時間がかかります。

そうした時、Modalを使うことで、ローカルから実行するかのようにGPUを使ったAI OCRが実現できます。ぜひAI OCRを使って、業務効率化を図ってみてください。

kotaro-kinoshita/yomitoku: Yomitoku is an AI-powered document image analysis package designed specifically for the Japanese language.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up