RPA(Robot Process Automation)のセミナ受講してから、利用相談しばしば。
鍵の一つがOCR(Optical Character Reader/Recognition)。
日本語の漢字、カタカナ、ひらがな、アルファベットの混在文書をどれくらい認識するか。
英語などのアルファベットだけと比べたら、はるかに認識率が異なる。
中国の漢字だけとくらべても、カタカナ、ひらがながある分だけ不利かも。
イロハでかんがえると、最初の4文字中、三文字が漢字とカタカナで類似の形。
口(くち)とロ
八(はち)とハ
二(2)とニ
ひらがなは文法で押さえたり、カタカナは連続性で推測したりすることは機械学習でも可能かもしれない。
Tesseract OCR
https://github.com/tesseract-ocr/tesseract
open-ocr
https://github.com/tleyden/open-ocr
Colaboratory
1「コード」を選択する
2 命令を入力
3 右向き三角 ▷ を押す
apt install tesseract-ocr
File "<ipython-input-1-075867a5190f>", line 1
apt install tesseract-ocr
^
SyntaxError: invalid syntax
SEARCH STACK OVERFLOW
!をつけ忘れ。
!apt install tesseract-ocr
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 13 not upgraded.
Need to get 4,795 kB of archives.
After this operation, 15.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-eng all 4.00~git24-0e00fe6-1.2 [1,588 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-osd all 4.00~git24-0e00fe6-1.2 [2,989 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr amd64 4.00~git2288-10f4998a-2 [218 kB]
Fetched 4,795 kB in 1s (4,294 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 110851 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-eng_4.00~git24-0e00fe6-1.2_all.deb ...
Unpacking tesseract-ocr-eng (4.00~git24-0e00fe6-1.2) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_4.00~git24-0e00fe6-1.2_all.deb ...
Unpacking tesseract-ocr-osd (4.00~git24-0e00fe6-1.2) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.00~git2288-10f4998a-2_amd64.deb ...
Unpacking tesseract-ocr (4.00~git2288-10f4998a-2) ...
Setting up tesseract-ocr-osd (4.00~git24-0e00fe6-1.2) ...
Setting up tesseract-ocr-eng (4.00~git24-0e00fe6-1.2) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Setting up tesseract-ocr (4.00~git2288-10f4998a-2) ...
!apt install libtesseract-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
libleptonica-dev
The following NEW packages will be installed:
libleptonica-dev libtesseract-dev
0 upgraded, 2 newly installed, 0 to remove and 13 not upgraded.
Need to get 2,755 kB of archives.
After this operation, 13.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libleptonica-dev amd64 1.75.3-3 [1,308 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libtesseract-dev amd64 4.00~git2288-10f4998a-2 [1,447 kB]
Fetched 2,755 kB in 1s (2,905 kB/s)
Selecting previously unselected package libleptonica-dev.
(Reading database ... 110898 files and directories currently installed.)
Preparing to unpack .../libleptonica-dev_1.75.3-3_amd64.deb ...
Unpacking libleptonica-dev (1.75.3-3) ...
Selecting previously unselected package libtesseract-dev.
Preparing to unpack .../libtesseract-dev_4.00~git2288-10f4998a-2_amd64.deb ...
Unpacking libtesseract-dev (4.00~git2288-10f4998a-2) ...
Setting up libleptonica-dev (1.75.3-3) ...
Setting up libtesseract-dev (4.00~git2288-10f4998a-2) ...
!apt-get update
Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64 InRelease
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 InRelease
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64 Release
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages [27.3 kB]
Get:13 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [310 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [910 kB]
Get:15 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [3,451 B]
Get:16 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [140 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [638 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [6,955 B]
Get:19 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [3,666 B]
Fetched 2,312 kB in 2s (1,155 kB/s)
Reading package lists... Done
!pip install pyocr
Collecting pyocr
Downloading https://files.pythonhosted.org/packages/37/54/2d169a102a3727f3ebe535da9263babb88a5862516ae9a798a7e458399a6/pyocr-0.5.3.tar.gz
Requirement already satisfied: Pillow in /usr/local/lib/python3.6/dist-packages (from pyocr) (4.0.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from pyocr) (1.11.0)
Requirement already satisfied: olefile in /usr/local/lib/python3.6/dist-packages (from Pillow->pyocr) (0.46)
Building wheels for collected packages: pyocr
Running setup.py bdist_wheel for pyocr ... done
Stored in directory: /root/.cache/pip/wheels/ff/94/8e/dccadc6bce17c41a9dbb0c7ccd44acdb9dcc0edd9efa42eaf6
Successfully built pyocr
Installing collected packages: pyocr
Successfully installed pyocr-0.5.3
!curl -L https://github.com/tesseract-ocr/tessdata/raw/master/jpn.traineddata > jpn.traineddata
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 145 100 145 0 0 1132 0 --:--:-- --:--:-- --:--:-- 1141
100 34.0M 100 34.0M 0 0 20.1M 0 0:00:01 0:00:01 --:--:-- 35.4M
#モジュールをインポート
from PIL import Image
import sys
import pyocr
import pyocr.builders
#OCRが使用可能かをチェック
tools = pyocr.get_available_tools()
if len(tools) == 0:
File "<ipython-input-14-e5795584e032>", line 1
if len(tools) == 0:
^
SyntaxError: unexpected EOF while parsing
参考資料(reference)
意外と知られていない「Googleドライブ」OCR機能で文字起こし検証→かなり使えそうでしたのでご紹介
https://webkikaku.co.jp/blog/webservices/googledrive-ocr/
参考資料 @ Qiita
ColaboratoryでTesseract-OCRを動かしてみる
https://qiita.com/m-hayashi/items/e2acc640fb436d09f128
【PyOCR】画像から日本語の文字データを抽出する
https://qiita.com/mczkzk/items/393abc70836b9bde2f60
Goとtesseractで簡易OCRサーバを作る
https://qiita.com/fumizp/items/63243cf418d27898f208
文書履歴(document history)
ver. 0.01 初稿
ver. 0.02 google drive追記 20190807
最後までおよみいただきありがとうございました。
いいね 💚、フォローをお願いします。
Thank you very much for reading to the last sentence.
Please press the like icon 💚 and follow me for your happy life.