More than 3 years have passed since last update.

Googleのロゴをダウンロード → OCRでテキスト化 → HTML上で表示

Last updated at 2020-04-02Posted at 2020-04-02

概要

以下の様に、Google 検索のトップページにあるロゴをテキストに変換し、それをHTML上で表示します。

↓

応用例

画像形式でネット上に公開されている英文の書物などをこの方法でHTMLにまとめ、 Chrome のページ翻訳機能で日本語化して読む、という応用ができます。

実行ステップ

Google 検索のトップページをスクレイピングして Google のロゴ画像のURLを取得します。さらに、画像をダウンロードします。
ロゴ画像にOCRをかけてテキスト化します。
このテキストをHTML上で表示します。

事前にライブラリをインストール

bash

# ステップ 1 用
pip install beautifulsoup4

# ステップ 2 用
brew install tesseract
pip install pyocr

# ステップ 3 用
pip install jinja2

実行

ステップ1：ロゴ画像のダウンロード

python

import requests
from bs4 import BeautifulSoup

# html 取得
url = 'https://www.google.com'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

# 画像を抽出
img = soup.find('img', {'id': 'hplogo'})

# 画像のURLを作成
img_url = 'https://www.google.com' + img['src']

# 画像をダウンロード
r = requests.get(img_url)

# 画像を保存
with open('hplogo.jpg' ,'wb') as file:
    file.write(r.content)

ステップ2：OCRでロゴ画像をテキスト化

python

from PIL import Image
import pyocr
import pyocr.builders

# 事前設定 1
tools = pyocr.get_available_tools()
tool = tools[0]

# 事前設定 2
builder = pyocr.builders.TextBuilder()

# 画像をロード
img = Image.open('hplogo.jpg')

# OCRを実行
result = tool.image_to_string(img, builder=builder)

ステップ3：テキストをHTML上で表示

python

from jinja2 import Template

# view を生成
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <title>The Farther Reaches Of Human Nature</title>
</head>
<body>

    <h1>{{ result }}</h1>

</body>
</html>
'''
template = Template(html)
data = { 'result': result }
view = template.render(data)

# 保存
with open('hplogo.html', 'w', encoding='utf-8') as f:
    f.write(view)

生成されるhplogo.htmlをブラウザで開くと次のように「Google」というテキストが表示されるはずです。（画像を再掲）

参考

10分で理解する Beautiful Soup - Qiita
Pythonで画像スクレイピングをしよう - Qiita
PythonでOCRを実行する方法 | ガンマソフト株式会社
 Pythonで久しぶりにHTMLを出力したくなったのでテンプレートについて調べる - Qiita

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up