0
3

More than 3 years have passed since last update.

Tesseract-OCRの導入方法と使い方

Last updated at Posted at 2020-05-12

tesseract-OCRの導入方法

https://gammasoft.jp/blog/tesseract-ocr-install-on-windows/
・tesseract-ocr-w64-setup-v5.0.0-alpha.20200223.exeを実行
・Additional script data(download):Japanese scriptとJapanese vertical scriptにチェック
・Additional language data(download):JavaneseとJapaneseとJapanese(vertical)にチェック

https://poppler.freedesktop.org/
・popplerのフォルダをダウンロードする

環境変数の設定

・Tesseract-OCR
・poppler-0.67.0\bin
 上記をPATHに追加する

コードの書き方(OCRツールとPDF化)

import os
from PIL import Image
from matplotlib import pyplot as plt
import cv2
from pdf2image import convert_from_path
import pyocr
import pyocr.builders
import sys
import pandas as pd
import time
import numpy as np
import glob
import shutil
#OCRツール自体
def OCR_read(PIL_data):

    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        print("No OCR tool found")
        sys.exit(1)

    tool = tools[0]

    txt = tool.image_to_string( # ここでOCRの対象や言語,オプションを指定する
            PIL_data,
            lang='jpn',
            builder=pyocr.builders.TextBuilder(tesseract_layout=6)
            )

    txt1 = txt.replace(' ','').replace('\n','').replace('|','')
    return txt1
#PDFファイルを画像変換
def pdftoimage(work_directory, path1):
    images = convert_from_path(path1)
    i = 0
    for image in images:


        print("Making work{}.png ...".format(i))
        image.save(work_directory +"/Output_folder/"+ "work{}.png".format(i))

        i += 1
    imax =i
    return imax
0
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
3