vision aiとgemini apiで領収書の費用抜き出してみた

Posted at 2025-03-01

はじめに

本記事では、Google Cloud Vision AI と Gemini を使用して、領収書の写真から経費情報を自動的に抽出してみた結果および使い方について解説します。

使用する技術について

Google Cloud Vision AI

Vision AI は、Google Cloud が提供する画像認識サービスです。本システムでは、特に OCR（光学文字認識）機能を使用して、画像から文字情報を抽出します。主な特徴は以下の通りです。

高精度な文字認識能力
複数言語対応
文字の位置情報も取得可能

Google Gemini

Gemini は Google の最新の生成 AI モデルです。本システムでは以下の2つのモデルを使用します。

Gemini Pro: 高性能な汎用モデル
Gemini Flash: より高速な処理が可能なモデル

事前準備

コードを動かすために、以下の準備が必要です。

Google Cloud プロジェクトの作成
Vision AI API の有効化
サービスアカウントの作成と認証情報（JSON）のダウンロード
Gemini API キーの取得
Python 3.10 のインストール
Pipenv のインストール

環境構築

1. 必要なファイルの作成

まず、以下の構成でプロジェクトを作成します：

project/
├── .env
├── Pipfile
└── main.py

2. 環境変数の設定

.env ファイルに認証情報を設定します：

GOOGLE_APPLICATION_CREDENTIALS = "path/to/your-credentials.json"
GEMINI_API_KEY = "your-gemini-api-key"

3. 依存関係の管理

Pipfile には必要なパッケージを記述します：

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
google-cloud-vision = "*"
google-generativeai = "*"
google-genai = "*"

[requires]
python_version = "3.10"

4. `main.py`の作成

以下がmain.pyのソースコードです。

import os
from google.cloud import vision
import google.generativeai as genai

def detect_text(path):
    """Detects text in the file."""
    client = vision.ImageAnnotatorClient()

    with open(path, 'rb') as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    response = client.text_detection(image=image)
    texts = response.text_annotations
    return texts[0].description # これで全文取得

def generate_text(model, text):
    response = model.generate_content(text)
    return response.text

if __name__ == '__main__':
    GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
    genai.configure(api_key=GEMINI_API_KEY)
    gemini_flash = genai.GenerativeModel("gemini-1.5-flash")
    gemini_pro = genai.GenerativeModel("gemini-pro")
    text = detect_text('path/to/your-image.jpg')
    prompt = f"""
領収書を文字起こししたテキストを提供します。
経費申請のために何費で何円か出力してください。
費用の種類が複数あれば複数出力してください。
## 出力フォーマット
フォーマットは
何費 : 何円
という形式にしてください。
例としては以下です。
食費 : 120円
交際費 : 120円
光熱費 : 13000円
## テキスト
{text}
    """
    print("gemini pro")
    print(generate_text(gemini_pro, prompt))
    print("gemini flash")
    print(generate_text(gemini_flash, prompt))

コードの解説

main.py の主要な機能を解説します。

1. テキスト検出機能

def detect_text(path):
    client = vision.ImageAnnotatorClient()
    with open(path, 'rb') as image_file:
        content = image_file.read()
    image = vision.Image(content=content)
    response = client.text_detection(image=image)
    texts = response.text_annotations
    return texts[0].description

この関数は Vision AI を使用して画像からテキストを抽出します。

2. テキスト生成機能

def generate_text(model, text):
    response = model.generate_content(text)
    return response.text

この関数は Gemini モデルを使用して、抽出されたテキストから経費情報を構造化します。

使用方法

環境のセットアップ

pipenv install

プログラムの実行

pipenv run python main.py

結果

以下のような画像を用意しました。店名の情報を隠すために黒塗りしています。(プログラムを動かす際は黒塗りしていません。)

以下が得られた出力です。

食費 : 800円
食費 : 800円

正しく抽出できていることがわかります。

まとめ

本記事では、Vision AI の高精度な OCR 機能と Gemini の自然言語処理能力を組み合わせることで、領収書からの経費情報抽出を実装できました。

注意点

画像の品質によって認識精度が変わる場合があります
API の利用には料金が発生する可能性があります
実運用時にはエラーハンドリングの追加を推奨します

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up