More than 1 year has passed since last update.

【GCP】PDFファイルをOCR処理し、内容を構造化してCSVファイルを作成するツールを作ってみた（その１）

Posted at 2024-05-11

１．きっかけ

自分の勤務先で提供しているシステムについて、データの登録方法が他システムから連携もしくは手動入力のため、徴求資料をPDF化し、データ登録はできないものかと調べてたらGCP VisionAIやGCP-4を組み合わせてできそうな感じがした為、まずは簡単にツールを作って試してみたいと思った次第。

２．おおまかな要件

①　ローカルのPDFファイルをGoogle Cloud Storageに連携
②　連携したPDFファイルをGCP VisionAIでOCR処理し、JSONファイルに出力

↑ここまでが（その１）の範囲

③　②で作成したJSONファイルをGCP-4のAPIを使用し、予め用意した構造化モデル（大げさ）設定ファイルを基に解析
④　③で解析した結果をCSVファイルで出力

↑ここまでが（その２）の範囲

３．ツールのイメージ図

４．事前準備

・GCPアカウントの作成
　参考URL：Google Cloudの始め方（アカウント作成編）
・GCPプロジェクトの作成
　参考URL：Google Cloud Platform（GCP）に新しいプロジェクトを作成する方法
・Google Cloud Storageで格納先を作成
　参考URL：Google Cloud Storage(GCS)を使ってみよう
・GCP VisionAI API有効化
　参考URL：[Python] Vision AIをAPI経由で使えるようにするまで
・GCPサービスアカウントの作成
　参考URL：GCPサービスアカウントを作成する方法 – 権限・鍵管理も解説

５．プログラム

（１）Google Cloud Storageへのファイルアップロード

GitHub:File_Upload.py

①GCSと接続するための設定

・作成したGCSへ接続するためにGCPサービスアカウントの鍵および格納先のバケットの指定を行う。

File_Upload.py

import os
from google.cloud import storage

#クラウドストレージ（バケット）に接続
# 鍵ファイルを記載
credential_path = '【GCPサービスアカウント作成時に作成の鍵ファイル名を記載】'

# 鍵ファイルのパスを記載
os.environ['【鍵ファイルが格納されているファイルパスを記載】'] = credential_path

# バケット名を記載
bucket_name = "【GCSのバケット名を記載】"

client = storage.Client()
bucket = client.get_bucket(bucket_name)

・指定したバケットへローカルからファイルを連携

File_Upload.py

#ファイルをアップロード
input_dir_path = input('保存元のパスを入力してください。>>')
input_file_name = input('保存元のファイル名を入力してください。（PDF）>>')

read_pdf_file = os.path.join(input_dir_path, input_file_name)

blob = bucket.blob(input_file_name)
blob.upload_from_filename(filename=read_pdf_file)

・実行し、ローカルの保存フォルダ、ファイル名を入力後、GoogleCloudStorageにファイルが保存されていることを確認

～\ocr_csv_create_tool>python File_Upload.py
保存元のパスを入力してください。>>・・・・・
保存元のファイル名を入力してください。（PDF）>>・・・・・

（２）Vision AIでのOCR処理

GitHub:PDF_Read.py
・GCSのバケットに格納しているファイルをVisionAIでOCR処理

PDF_Read.py

import os
import json
import re
from google.cloud import vision
from google.cloud import storage
from google.protobuf import json_format

# GCP接続については省略

# VisionAIで指定のファイルをOCR処理
# 対象ファイルをPDFに指定
mime_type = 'application/pdf'
batch_size = 2
client = vision.ImageAnnotatorClient()

feature = vision.Feature(
    type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

# 対象のPDFファイルパスを設定
gcs_source = vision.GcsSource(uri=gcs_source_uri)

input_config = vision.InputConfig(
    gcs_source=gcs_source, mime_type=mime_type)

gcs_destination = vision.GcsDestination(uri=f"{gcs_destination_uri}/")
output_config = vision.OutputConfig(
    gcs_destination=gcs_destination, batch_size=batch_size)

async_request = vision.AsyncAnnotateFileRequest(
    features=[feature], input_config=input_config,
    output_config=output_config)

operation = client.async_batch_annotate_files(
    requests=[async_request])

print('Waiting for the operation to finish.')
operation.result(timeout=180)

・OCR処理後の内容についてコマンドにて出力

qiita.rb

storage_client = storage.Client()

match = re.match(r"gs://([^/]+)/(.+)", gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)

bucket = storage_client.get_bucket(bucket_name)

# List objects with the given prefix, filtering out folders.
blob_list = [
    blob
    for blob in list(bucket.list_blobs(prefix=prefix))
    if not blob.name.endswith("/")
]
print("Output files:")
for blob in blob_list:
    print(blob.name)

output = blob_list[0]

# OCR処理後の判別した文字について画面に表示
json_string = output.download_as_bytes().decode("utf-8")
response = json.loads(json_string)

# The actual response for the first page of the input file.
first_page_response = response["responses"][0]
annotation = first_page_response["fullTextAnnotation"]

# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
print("Full text:\n")
print(annotation["text"])

・上記ファイル実行するとOCR処理した内容がコマンドラインに出力する。
・また、GCSのバケットにOCR処理したJSONファイルが出力されている。
※作成されたJSONファイルは人の目で見るのは難しいため、JSONファイルを解析して、コマンドライン出力し、問題なくOCR化されていることを確認

（その２）で実際にGPTを使用し、構造化する流れを投稿予定。

６．ここまでにかかるざっくりの費用感について

・プロジェクト作ってCloudStorageに適当なPDFファイル格納
・VisionAIを複数回テスト（～1,000ユニットまでなら無料）
→ここまでで150円前後。トライアルの無料枠には問題なく収まる金額

７．参考

・ツール全体の参考
　参考URL:GPTが人知れず既存の名刺管理アプリを抹殺していた話
　→ この記事がきっかけで今回のツールの作成に至りました。非常に感謝です。
・GCSへのデータ連携の参考
　参考URL:PythonでGoogleCloudStorageへファイルをアップロード・ダウンロードする方法

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up