More than 5 years have passed since last update.

画像からテキストを抽出した

Python

Posted at 2020-01-11

OCRを使ってPDFや画像ファイルからテキストを抽出するプログラムを書きました。
PDFで送られてくる課題の自動化のことを友達と話したり、また別の友達のレポートをWordに打ち込んだりしていてOCR使った方が楽じゃねって事で、じゃあOCR使ってプログラム書いてみようかなあと思ったのがきっかけです。

動作

PDFまたは画像ファイルを選択
Tkinter.filedialog.askopenfilenamesを使ってPDFとJPGファイルを選択します。
ファイルは一度に複数選択できるようにします。
PDFは画像に変換
PDFから直接OCRは行えないので一度画像に変換します。
popplerとpdf2imageを使っています。
画像からテキストを抽出する
PDFから変換した画像と1で選択した画像からテキストを抽出します。
tesseractとPyOCRを使います。
<選択したファイル名>.txtで出力
選択したファイルが、hoge.jpgの場合は、hoge.txtで出力するようにします。

コード

main.py

import os
import pyocr
import tkinter
from tkinter import filedialog
from pdf2image import convert_from_path
from PIL import Image


class UseOCR:

    def __init__(self):
        pyocr.tesseract.TESSERACT_CMD = '/usr/local/bin/tesseract'
        self.poppler_executable_path = '/usr/local/bin/'
        self.initialdir = '~/'
        self.extract_lang = 'jpn+eng'
        self.extension = [('pdf files', '*.pdf'),
                          ('jpeg file', '*.jpeg'),
                          ('jpg file', '*.jpg'),
                          ('png file', '*.png')]

    def askfilenames(self):
        root = tkinter.Tk()
        root.withdraw()
        path = filedialog.askopenfilenames(filetypes=self.extension, initialdir=self.initialdir)
        return path

    @staticmethod
    def get_fileinfo(path):
        basename = tuple(map(os.path.basename, path))
        fileinfo = dict(zip(basename, path))
        return fileinfo

    def pdf_to_image(self, pdf):
        image = convert_from_path(pdf, poppler_path=self.poppler_executable_path)
        return image

    def image_to_text(self, image):
        tool = pyocr.get_available_tools()[0]
        txt = tool.image_to_string(
            image,
            lang='jpn',
            builder=pyocr.builders.TextBuilder()
        )
        return txt


if __name__ == '__main__':
    OCR = UseOCR()
    path = OCR.askfilenames()
    fileinfo = OCR.get_fileinfo(path)
    for basename, path in fileinfo.items():
        filename, extension = os.path.splitext(basename)
        if extension == '.pdf':
            image = OCR.pdf_to_image(path)[0]
            txt = OCR.image_to_text(image)
        else:
            image = Image.open(path)
            txt = OCR.image_to_text(image)
        with open('./output/{}.txt'.format(filename), mode='w') as f:
            f.write(txt)

説明

以下の文章が記入されたPDFを使って説明します。
（この画像は、PDFをjpgに書き出してトリミングしたものです。）

コンストラクタ

# pyocrのTESSERACT_CMDをtesseractのパスに書き換え。場所 → which tesseract
pyocr.tesseract.TESSERACT_CMD = '/usr/local/bin/tesseract'

# convert_from_path()の引数に代入するpopplerのパス。 場所 → which pdfinfo
self.poppler_executable_path = '/usr/local/bin/'

# tkinterが起動したときのディレクトリ
self.initialdir = '~/'

# OCRする文字
self.extract_lang = 'jpn+eng'

# tkinterで選択する拡張子の指定
self.extension = [('pdf files', '*.pdf'),
                  ('jpeg file', '*.jpeg'),
                  ('jpg file', '*.jpg'),
                  ('png file', '*.png')]

askfilenames

Tkinterで選択されたファイルのフルパスのタプルを返します。

>>> path = OCR.askfilenames()
>>> path
('/Users/Username/Desktop/hoge.pdf',)

get_fileinfo

フルパスのタプルを引数にすると、ファイル名とフルパスの辞書を返します。

>>> fileinfo = OCR.get_fileinfo(path)
>>> fileinfo
{'hoge.pdf': '/Users/Username/Desktop/hoge.pdf'}

pdf_to_image

引数にPDFファイルのパスを渡すと、PILのImageオブジェクトのリストを返します。
pdf2imageとPyOCRはPillowと依存関係なので画像ファイルにせずImageオブジェクトを返した方が扱いが楽になります。

>>> for k,v in fileinfo.items():
...     image = OCR.pdf_to_image(v)
>>> image
[<PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1654x2339 at 0x10E1749E8>]

image_to_text

OCRを行う中枢な処理です。
引数に画像ファイルまたはImageオブジェクトを渡すと、OCRを行いテキストを返します。

>>> txt = OCR.image_to_text(image[0])
>>> txt
'テストtest文字0123'

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up