More than 1 year has passed since last update.

【AWS】Amazon Textract備忘録

Posted at 2023-06-02

Amazon Textractは、スキャンしたドキュメントからテキスト、手書き文字、およびデータを自動的に抽出する機械学習 (ML) サービスです。

主な機能は以下です。

テキスト抽出:
Textractは、画像またはPDFといったドキュメント形式からテキストを抽出します。印刷文字、手書き文字、テーブルのデータなど、さまざまなタイプのテキストを認識することができます。
テーブル抽出:
Textractは、画像やPDF内のテーブルデータを認識し、テーブルのセル、行、列の情報を抽出します。この情報を利用して、テーブルデータを構造化して取得することができます。
フォーム抽出:
Textractは、フォーム内のフィールドやキーバリューペアを自動的に検出し、抽出します。これには、氏名、住所、日付、金額など、様々な種類のデータを含むフォームが対象となります。
キー情報の抽出:
Textractは、特定のキーワードやパターンを含む情報を検出することもできます。例えば、特定のキーワードや値を持つ項目を抽出することができます。
コンフィデンススコアの提供:
Textractは、抽出されたテキストやデータの信頼性を示すコンフィデンススコアも提供します。これにより、抽出結果の信頼性を評価することができます。

使い方

import os
import sys
import logging
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)

AWS_ACCESS_KEY_ID=************
AWS_SECRET_ACCESS_KEY=************************
AWS_SESSION_TOKEN=********************************************


class TextractWrapper:
    """Encapsulates Textract functions."""
    def __init__(self, textract_client, s3_resource, sqs_resource):
        """
        :param textract_client: A Boto3 Textract client.
        :param s3_resource: A Boto3 Amazon S3 resource.
        :param sqs_resource: A Boto3 Amazon SQS resource.
        """
        self.textract_client = textract_client
        self.s3_resource = s3_resource
        self.sqs_resource = sqs_resource

    def detect_file_text(self, *, document_file_name=None, document_bytes=None):
        """
        Detects text elements in a local image file or from in-memory byte data.
        The image must be in PNG or JPG format.

        :param document_file_name: The name of a document image file.
        :param document_bytes: In-memory byte data of a document image.
        :return: The response from Amazon Textract, including a list of blocks
                 that describe elements detected in the image.
        """
        if document_file_name is not None:
            with open(document_file_name, 'rb') as document_file:
                document_bytes = document_file.read()
        try:
            response = self.textract_client.detect_document_text(
                Document={'Bytes': document_bytes})
            logger.info(
                "Detected %s blocks.", len(response['Blocks']))
        except ClientError:
            logger.exception("Couldn't detect text.")
            raise
        else:
            return response

# Document
documentName = r"img/sample.png"
textract_client = boto3.client('textract', 
                        aws_access_key_id=AWS_ACCESS_KEY_ID, 
                        aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
                        region_name=AWS_REGION,
                        aws_session_token= AWS_SESSION_TOKEN)

s3_resource = ''
sqs_resource = ''
Textract = TextractWrapper(textract_client, s3_resource, sqs_resource)
response = Textract.detect_file_text(document_file_name=documentName, document_bytes=imageBytes)
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up