LoginSignup
1
1

RPA(Robot Process Automation)のセミナ受講してから、利用相談しばしば。

鍵の一つがOCR(Optical Character Reader/Recognition)。

日本語の漢字、カタカナ、ひらがな、アルファベットの混在文書をどれくらい認識するか。

英語などのアルファベットだけと比べたら、はるかに認識率が異なる。

中国の漢字だけとくらべても、カタカナ、ひらがながある分だけ不利かも。

イロハでかんがえると、最初の4文字中、三文字が漢字とカタカナで類似の形。

口(くち)とロ
八(はち)とハ
二(2)とニ

ひらがなは文法で押さえたり、カタカナは連続性で推測したりすることは機械学習でも可能かもしれない。

Tesseract OCR
https://github.com/tesseract-ocr/tesseract

open-ocr
https://github.com/tleyden/open-ocr

Colaboratory

1「コード」を選択する
2 命令を入力
3 右向き三角 ▷ を押す

apt install tesseract-ocr

  File "<ipython-input-1-075867a5190f>", line 1
    apt install tesseract-ocr
              ^
SyntaxError: invalid syntax
SEARCH STACK OVERFLOW

!をつけ忘れ。

!apt install tesseract-ocr
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 13 not upgraded.
Need to get 4,795 kB of archives.
After this operation, 15.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-eng all 4.00~git24-0e00fe6-1.2 [1,588 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-osd all 4.00~git24-0e00fe6-1.2 [2,989 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr amd64 4.00~git2288-10f4998a-2 [218 kB]
Fetched 4,795 kB in 1s (4,294 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 110851 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-eng_4.00~git24-0e00fe6-1.2_all.deb ...
Unpacking tesseract-ocr-eng (4.00~git24-0e00fe6-1.2) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_4.00~git24-0e00fe6-1.2_all.deb ...
Unpacking tesseract-ocr-osd (4.00~git24-0e00fe6-1.2) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.00~git2288-10f4998a-2_amd64.deb ...
Unpacking tesseract-ocr (4.00~git2288-10f4998a-2) ...
Setting up tesseract-ocr-osd (4.00~git24-0e00fe6-1.2) ...
Setting up tesseract-ocr-eng (4.00~git24-0e00fe6-1.2) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Setting up tesseract-ocr (4.00~git2288-10f4998a-2) ...
!apt install libtesseract-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libleptonica-dev
The following NEW packages will be installed:
  libleptonica-dev libtesseract-dev
0 upgraded, 2 newly installed, 0 to remove and 13 not upgraded.
Need to get 2,755 kB of archives.
After this operation, 13.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libleptonica-dev amd64 1.75.3-3 [1,308 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libtesseract-dev amd64 4.00~git2288-10f4998a-2 [1,447 kB]
Fetched 2,755 kB in 1s (2,905 kB/s)
Selecting previously unselected package libleptonica-dev.
(Reading database ... 110898 files and directories currently installed.)
Preparing to unpack .../libleptonica-dev_1.75.3-3_amd64.deb ...
Unpacking libleptonica-dev (1.75.3-3) ...
Selecting previously unselected package libtesseract-dev.
Preparing to unpack .../libtesseract-dev_4.00~git2288-10f4998a-2_amd64.deb ...
Unpacking libtesseract-dev (4.00~git2288-10f4998a-2) ...
Setting up libleptonica-dev (1.75.3-3) ...
Setting up libtesseract-dev (4.00~git2288-10f4998a-2) ...
!apt-get update
Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64  InRelease
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  InRelease
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64  Release
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages [27.3 kB]
Get:13 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [310 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [910 kB]
Get:15 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [3,451 B]
Get:16 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [140 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [638 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [6,955 B]
Get:19 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [3,666 B]
Fetched 2,312 kB in 2s (1,155 kB/s)
Reading package lists... Done

!pip install pyocr
Collecting pyocr
  Downloading https://files.pythonhosted.org/packages/37/54/2d169a102a3727f3ebe535da9263babb88a5862516ae9a798a7e458399a6/pyocr-0.5.3.tar.gz
Requirement already satisfied: Pillow in /usr/local/lib/python3.6/dist-packages (from pyocr) (4.0.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from pyocr) (1.11.0)
Requirement already satisfied: olefile in /usr/local/lib/python3.6/dist-packages (from Pillow->pyocr) (0.46)
Building wheels for collected packages: pyocr
  Running setup.py bdist_wheel for pyocr ... done
  Stored in directory: /root/.cache/pip/wheels/ff/94/8e/dccadc6bce17c41a9dbb0c7ccd44acdb9dcc0edd9efa42eaf6
Successfully built pyocr
Installing collected packages: pyocr
Successfully installed pyocr-0.5.3

!curl -L https://github.com/tesseract-ocr/tessdata/raw/master/jpn.traineddata > jpn.traineddata
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   145  100   145    0     0   1132      0 --:--:-- --:--:-- --:--:--  1141
100 34.0M  100 34.0M    0     0  20.1M      0  0:00:01  0:00:01 --:--:-- 35.4M
#モジュールをインポート
from PIL import Image
import sys
import pyocr
import pyocr.builders
#OCRが使用可能かをチェック
tools = pyocr.get_available_tools()
if len(tools) == 0:
  File "<ipython-input-14-e5795584e032>", line 1
    if len(tools) == 0:
                       ^
SyntaxError: unexpected EOF while parsing

参考資料(reference)

意外と知られていない「Googleドライブ」OCR機能で文字起こし検証→かなり使えそうでしたのでご紹介
https://webkikaku.co.jp/blog/webservices/googledrive-ocr/

参考資料 @ Qiita

ColaboratoryでTesseract-OCRを動かしてみる
https://qiita.com/m-hayashi/items/e2acc640fb436d09f128

【PyOCR】画像から日本語の文字データを抽出する
https://qiita.com/mczkzk/items/393abc70836b9bde2f60

Goとtesseractで簡易OCRサーバを作る
https://qiita.com/fumizp/items/63243cf418d27898f208

文書履歴(document history)

ver. 0.01 初稿
ver. 0.02 google drive追記 20190807

最後までおよみいただきありがとうございました。

いいね 💚、フォローをお願いします。

Thank you very much for reading to the last sentence.

Please press the like icon 💚 and follow me for your happy life.

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1