{作成}Ollamaとllava-phi3(vision)の精度検証とtesseract-ocrの構築手順と実行

Last updated at 2024-12-26Posted at 2024-12-21

tesseract-ocrの構築手順

環境
WSL2

手順

sudo apt -y install tesseract-ocr tesseract-ocr-jpn libtesseract-dev libleptonica-dev tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert

https://www.kkaneko.jp/ai/ubuntu/tesseract.html
or

sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-jpn

tesseract a.png outbase -l jpn

cat outbase.txt

目次
Ollama Visionって何?

Ollama Visionの使い方
Pythonライブラリから使う方法
JavaScriptライブラリから使う方法

この時点でも実行できるがPythonで実行する場合は

pip install pytesseract

が必要。

コード１

from PIL import Image
import pytesseract

# Tesseractの実行ファイルパスを設定（必要な場合）
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

# 画像からテキストを抽出
image = Image.open('a.png')
text = pytesseract.image_to_string(image, lang='jpn')
print(text)

コード２

from PIL import Image
import pytesseract
import pyautogui

# Tesseractの実行ファイルパスを設定
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

# 画像を開く
image = Image.open('a.png')

# テキストと座標情報を取得
data = pytesseract.image_to_data(image, lang='jpn', output_type=pytesseract.Output.DICT)

# 結果を出力
for i, text in enumerate(data['text']):
    if text.strip():
        x = data['left'][i]
        y = data['top'][i]
        print(f"テキスト: {text}, 座標: ({x}, {y})")
def click_text(target_text):
    for i, text in enumerate(data['text']):
        if text.strip() == target_text:
            x = data['left'][i]
            y = data['top'][i]
            print(f"{target_text}を座標({x}, {y})でクリックします")
            pyautogui.click(x, y)
            return
    print(f"{target_text}が見つかりませんでした")

# 使用例
click_text("目次")

ollama + llava-phi3

手順

WSLで以下を実行しCtrl+Dで抜け出す

ollama run llava-phi3

コード３

import ollama
prompt="""
Output the results of the analysis of the browser screenshot under the following conditions.
Condition 1: Read all text information.
Condition 2: Output text and information readable from the image other than text.
"""
res = ollama.chat(
    model="llava-phi3",
    messages=[{
        'role': 'user',
        'content': prompt,
        'images': ['./a.png']
    }]
)

print(res['message']['content'])

結果：日本語訳済み
読み取らせた画像1

画像は紺色の画面に白い文字。画面に表示されている主な内容は、「HTTPB Writeup 」と書かれている。この見出しの下には、黒いテキストの空行が2行ある。画像の右側には、3つのグレーのアイコンがある小さなツールバーがある。これらのアイコンの位置はそれぞれ異なり、1つは右上隅、もう1つはその少し下、そして最後の1つは画面の下端に寄っている。これらのアイコンの正確な名前や機能は、この画像では見えない。

読み取らせた画像2

画像の左上の文字が切れていて、完全に読むことができない。画像の残りの部分には、「BACH 」や「X-Tunnel 」など様々な情報が含まれている。右下にはウェブサイトのアドレスも見える： https://www.xtunnel.com/。

その他の視覚的な情報としては、画面上にいくつかのボタンがあり、ページをナビゲートしたり、対話したりするのに使われているようだ。さらに、「BACH 2016 」を示すテキストと、「2015」、「4.3%」などのデータに関連しそうな数字がいくつかあるように見えるが、文脈がなければその目的を判断するのは難しい。

結果

少し複雑になると読み込めないようだ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up