More than 3 years have passed since last update.

jTessBoxEditorで生成したboxファイルの文字を一行ごとにmargeする【Tesseract】

Posted at 2020-04-15

概要

jTessBoxEditorで生成したboxファイルにて、文字をマージするのが非常に面倒だったので、作成。
margeして精度が上がるかどうかはよく分からなかったけど、供養のため、記事にしました。

条件

以下のファイルを作成します。この中身をjTessBoxEditorにペーストし、tiffファイルおよびboxファイル等を生成します。

text.txt

文字
テスト
プログラミング
mgs

プログラム

本プログラムは、「text.txt」に定義した文字列を一行ずつboxファイル中にて、一行ずつmargeしたboxファイルを再生成する。生成したファイルはtiffファイルと一緒に生成されたboxファイルと置き換えて、jTessBoxEditorにて学習させる際に指定する。

sample.py

import shutil

def read_words(input_file):
    with open(input_file, "r") as f:    
        data = [ t.replace("\n", "") for t in f.readlines()  ]  

def marge_box(input_file, words):
    with open(input_file, "r") as f:    
        data = [ t.replace("\n", "").split(" ") for t in f.readlines()  ]

        res = []
        start = 0
        for word in words:
            length = len(word)
            end = start + length
            tmp = data[start:end]
            print("word", word)
            print("start", start)
            print("end", end)
            print("length", length)

            a = [t[1] for t in tmp]
            b = [t[2] for t in tmp]
            c = [t[3] for t in tmp]
            d = [t[4] for t in tmp]
            e = [t[5] for t in tmp]
            a = a[0]
            b = min(b)
            c = max(c)
            d = max(d)
            e = e[0]
            print("abcde", [a, b, c, d, e])
            res.append([word, a, b, c, d, e])
            start += length

        shutil.copy(input_file, "copy_" + input_file) # 元のファイルはバックアップとしてコピーしておく

    with open(input_file, "w", encoding='utf-8') as f:  
        for t in res:
            print(" ".join(t), file=f)

if __name__ == "__main__":
    words = read_words("text.txt")
    marge_box("~~.box", words) # jTessBoxEditorで生成したboxファイルを指定

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up