0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

兵庫県の新型コロナウイルス感染症に関する情報の判断基準の画像からテキスト抽出

Last updated at Posted at 2020-12-31

兵庫県の新型コロナウイルス感染症に関する情報の判断基準の画像からテキスト抽出する

「現在は感染拡大特別期です」のところテキストかと思ったら画像になっている

Screenshot_2020-12-31 兵庫県 緊急時用トップページ.png

スクレイピング

import requests
from bs4 import BeautifulSoup

from urllib.parse import urljoin

url = "https://web.pref.hyogo.lg.jp/index.html"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}

r = requests.get(url, headers=headers)
r.raise_for_status()

soup = BeautifulSoup(r.content, "html.parser")

tag = soup.select_one("div#tmp_contents > p > img")

link = urljoin(url, tag.get("src"))

r = requests.get(link, headers=headers)
r.raise_for_status()

with open("alert.png", mode="wb") as fw:
    fw.write(r.content)

OCR

tesseract-ocrをインストール

!add-apt-repository ppa:alex-p/tesseract-ocr -y
!apt update
!apt install tesseract-ocr
!apt install libtesseract-dev
!tesseract -v

!apt install tesseract-ocr-jpn  tesseract-ocr-jpn-vert
!apt install tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert
!tesseract --list-langs
!pip install pytesseract

画像から文字を抽出

import pytesseract

import cv2
import numpy as np

from google.colab.patches import cv2_imshow

# 縁に黒いのが残っているので少し切り抜き
img_bgr = cv2.imread("alert.png")[10:-10, 10:-10]

# グレースケール
img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)

# 色確認
img_bgr[10, 10]

# 画像を確認
cv2_imshow(img_gray)

# 色カウント
black = np.sum(img_gray < 151)
white = np.sum(img_gray > 150)

# 白と黒どちらが多いか確認して黒が多い場合は反転
if white < black:
    ret, thresh = cv2.threshold(img_gray, 150, 255, cv2.THRESH_BINARY_INV)

else:
    ret, thresh = cv2.threshold(img_gray, 150, 255, cv2.THRESH_BINARY)

# 画像を確認
cv2_imshow(thresh)

txt = pytesseract.image_to_string(thresh, lang="jpn", config="--psm 6").strip()

txt
0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?