More than 5 years have passed since last update.

兵庫県の新型コロナウイルス感染症に関する情報の判断基準の画像からテキスト抽出

Last updated at 2021-01-17Posted at 2020-12-31

兵庫県の新型コロナウイルス感染症に関する情報の判断基準の画像からテキスト抽出する

「現在は感染拡大特別期です」のところテキストかと思ったら画像になっている

スクレイピング

import requests
from bs4 import BeautifulSoup

from urllib.parse import urljoin

url = "https://web.pref.hyogo.lg.jp/index.html"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}

r = requests.get(url, headers=headers)
r.raise_for_status()

soup = BeautifulSoup(r.content, "html.parser")

tag = soup.select_one("div#tmp_contents > p > img")

link = urljoin(url, tag.get("src"))

r = requests.get(link, headers=headers)
r.raise_for_status()

with open("alert.png", mode="wb") as fw:
    fw.write(r.content)

OCR

tesseract-ocrをインストール

!add-apt-repository ppa:alex-p/tesseract-ocr -y
!apt update
!apt install tesseract-ocr
!apt install libtesseract-dev
!tesseract -v

!apt install tesseract-ocr-jpn  tesseract-ocr-jpn-vert
!apt install tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert
!tesseract --list-langs
!pip install pytesseract

画像から文字を抽出

import pytesseract

import cv2
import numpy as np

from google.colab.patches import cv2_imshow

# 縁に黒いのが残っているので少し切り抜き
img_bgr = cv2.imread("alert.png")[10:-10, 10:-10]

# グレースケール
img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)

# 色確認
img_bgr[10, 10]

# 画像を確認
cv2_imshow(img_gray)

# 色カウント
black = np.sum(img_gray < 151)
white = np.sum(img_gray > 150)

# 白と黒どちらが多いか確認して黒が多い場合は反転
if white < black:
    ret, thresh = cv2.threshold(img_gray, 150, 255, cv2.THRESH_BINARY_INV)

else:
    ret, thresh = cv2.threshold(img_gray, 150, 255, cv2.THRESH_BINARY)

# 画像を確認
cv2_imshow(thresh)

txt = pytesseract.image_to_string(thresh, lang="jpn", config="--psm 6").strip()

txt

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up