More than 5 years have passed since last update.

Tesseract-OCRの導入方法と使い方

Last updated at 2020-05-13Posted at 2020-05-12

tesseract-OCRの導入方法

・https://gammasoft.jp/blog/tesseract-ocr-install-on-windows/
・tesseract-ocr-w64-setup-v5.0.0-alpha.20200223.exeを実行
・Additional script data(download)：Japanese scriptとJapanese vertical scriptにチェック
・Additional language data(download):JavaneseとJapaneseとJapanese(vertical)にチェック

・https://poppler.freedesktop.org/
・popplerのフォルダをダウンロードする

環境変数の設定

・Tesseract-OCR
・poppler-0.67.0\bin
　上記をPATHに追加する

コードの書き方(OCRツールとPDF化)

import os
from PIL import Image
from matplotlib import pyplot as plt
import cv2
from pdf2image import convert_from_path
import pyocr
import pyocr.builders
import sys
import pandas as pd
import time
import numpy as np
import glob
import shutil
# OCRツール自体
def OCR_read(PIL_data):
    
    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        print("No OCR tool found")
        sys.exit(1)

    tool = tools[0]

    txt = tool.image_to_string( # ここでOCRの対象や言語，オプションを指定する
            PIL_data,
            lang='jpn',
            builder=pyocr.builders.TextBuilder(tesseract_layout=6)
            )

    txt1 = txt.replace(' ','').replace('\n','').replace('|','')
    return txt1
# PDFファイルを画像変換
def pdftoimage(work_directory, path1):
    images = convert_from_path(path1)
    i = 0
    for image in images:
        
        
        print("Making work{}.png ...".format(i))
        image.save(work_directory +"/Output_folder/"+ "work{}.png".format(i))

        i += 1
    imax =i
    return imax

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up