Amazon S3 Vectors を使って図表を含む PDF ファイルでシンプルな RAG を構築

Last updated at 2025-09-20Posted at 2025-08-21

この記事は、「#ヌーラボブログリレー2025夏」のTechブログ 4日目の記事です。

はじめに

この記事では、Amazon S3 Vectors を使って図表を含む PDF ファイルでシンプルな RAG を構築する方法を紹介します。前回の記事、Amazon S3 Vectors を使ってシンプルな RAG を構築では、Amazon S3 Vectors を使って PDFからテキストを抽出し、シンプルな RAG を構築しました。今回は、 PDFの各ページから抽出したテキストと、各ページを画像化したうえでLLMで解析しベクトル化したものでRAGを構築します。

Amazon Bedrock Knowledge Basesを使ってRAGを構築する場合は、解析戦略で選択した解析方法によってBedrockがPDFなどから図表データを解析し、RAGを構築することができます。コードを書くことなく解析がなされるので、非常にシンプルです。

一方、Amazon S3 Vectorsでは埋め込みベクトルの生成を実装する必要があります。マネージドなAmazon Bedrock Knowledge Basesと比べ手間は掛かりますが、データの解析方法をカスタマイズすることができます。本記事では、図表を含むPDFファイルをAmazon S3 Vectorsで解析し、RAGを構築する方法を紹介します。

Amazon S3 Vectors とは

Amazon S3 Vectorsは、Amazon S3 にネイティブなベクトル検索機能を提供するサービスです。1 秒未満のクエリレイテンシーをネイティブでサポートする初めてのクラウドオブジェクトストアと謳われています。Amazon S3 上のベクトルバケットとベクトルインデックスを使用するため、Amazon S3 の簡単さ、耐久性、可用性、コスト効率性に、ネイティブのベクトル検索機能を享受することができます。

Amazon OpenSearch ServiceやOpenSearch Service マネージドクラスターとは補完関係にあるといわれており、 OpenSearch Service マネージドクラスターのエンジンとしてS3 Vectorsを使用したり、S3 Vectorsのインデックスをエクスポートして OpenSearch Serverless で利用することも可能です。また、Amazon Bedrock のベクトルデータストアとしても利用することができます。

S3 Vectors では、Amazon Bedrock 以外の埋め込みモデルを使用することも可能です。Amazon Bedrock Knowledge Bases の埋め込みモデルは Amazon Titan 2種類と Cohere の2種類のみですが、S3 Vectors では、OpenAI などの埋め込みモデルを使用することも可能です。

参考情報

https://aws.amazon.com/jp/blogs/machine-learning/building-cost-effective-rag-applications-with-amazon-bedrock-knowledge-bases-and-amazon-s3-vectors/

https://aws.amazon.com/jp/blogs/aws/introducing-amazon-s3-vectors-first-cloud-storage-with-native-vector-support-at-scale/

https://aws.amazon.com/jp/blogs/news/introducing-amazon-s3-vectors-first-cloud-storage-with-native-vector-support-at-scale/

https://aws.amazon.com/jp/blogs/news/optimizing-vector-search-using-amazon-s3-vectors-and-amazon-opensearch-service/

https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors.html

https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html

https://docs.aws.amazon.com/cli/latest/reference/s3vectors/

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3vectors.html

https://docs.aws.amazon.com/ja_jp/bedrock/latest/userguide/kb-advanced-parsing.html

処理概要

今回の実装では、以下のような処理を行います。

テキスト処理

PDFからテキストを抽出し、設定したチャンクサイズでチャンクに分割します。チャンクごとにTitan埋め込みを行います。埋め込みベクトルをS3 Vectorsに保存します。

画像処理

PDFの各ページを画像化します。画像ごとにLLMで内容解析を行い、Markdown化します。Markdown化した内容を設定したチャンクサイズでチャンクに分割します。チャンクごとにTitan埋め込みを行います。埋め込みベクトルをS3 Vectorsに保存します。

ベクトル検索

ユーザーの質問でTitan埋め込みを行います。埋め込みベクトルをS3 Vectorsで検索します。検索結果に含まれるメタデータからsource_textを取得します。source_textをプロンプトに統合し、LLMで回答を生成します。

回答生成

ユーザーの質問をプロンプトに統合し、LLMで回答を生成します。

フロー図

テキスト処理と画像処理

ベクトル検索と回答生成

環境構築

ここでは、すべて us-west-2 リージョンで操作を実施します。他のリージョンを使用する場合は、記事内のリージョン指定を使用するリージョン名に読み替えてください。

ベクトルバケットの作成

Amazon S3 のマネージドコンソールからベクトルバケットを作成します。Vector bucket (ベクトルバケット) を選択し、画面右上の Create Vector bucket(ベクトルバケットを作成) をクリックします。

ベクトルバケット名を入力します。ベクトルバケット名は S3 バケット名と異なり、全世界で一意ではありません。自分の AWS アカウント内の各リージョンで一意である必要があります。その他の命名規則は、Vector bucket naming requirementsを参照してください。抜粋したものは以下のとおりです。

ベクトルバケット名の長さは 3~63 文字
ベクトルバケット名には、英小文字 (a-z) 、数字 (0-9) 、ハイフン (-) のみ使用可能
ベクトルバケット名は、文字または数字ではじまり、文字または数字で終わる

暗号化タイプはデフォルトのままにします。デフォルトでは、Amazon S3 管理キー (SSE-S3) を使用してサーバー側の暗号化が行われます。Create Vector bucket(ベクトルバケットを作成) をクリックすると、ベクトルバケットが作成されます。

ベクトルバケットの削除

ベクトルバケットは、マネージドコンソールから削除することができません (2025年 8月 14日現在) 。ベクトルバケットを削除する場合は、AWS CLI を使用して削除します。ベクトルバケット内にベクトルインデックスが存在する場合は、事前にすべて削除する必要があります。

aws s3vectors delete-vector-bucket --vector-bucket-name "ベクトルバケット名" --region "リージョン名"

ベクトルインデックスの作成

続いて、ベクトルインデックスの作成を行います。作成したベクトルバケットをクリックし、ベクトルインデックス作成画面を開きます。

ベクトルインデックス名を入力します。ベクトルインデックス名はベクトルバケット内で一意である必要があります。その他の命名規則は、Vector index naming requirementsを参照してください。ルールはベクトルバケット名のものと同じです。

Dimension (次元数) は、使用する埋め込みモデルの次元数を指定します。後述の Amazon Titan Text Embeddings V2 にあわせて 1024 を指定します。
Distance metric (距離メトリック) は、ベクトル検索時に使用する距離メトリックを指定します。ここでは Cosine を選択します。

Additional settings を開き、Non-filterable metadata (フィルタリングできないメタデータ) を設定します。 Non-filterable metadata に指定したキーはクエリ時のメタデータフィルタリングには使用できません。フィルタリング可能なメタデータはデータサイズの上限が低いため、フィルタリングできないメタデータを使って埋め込み時の元データの保存を行います。このメタデータをベクトル検索後に LLM に渡すことで、LLM は回答生成に活用することができます。ここでは、source_text というキーを追加します。Add key をクリックし、source_text と入力します。最後に、Create vector index をクリックします。

ベクトルインデックス名、Dimension 、Distance metric 、Non-filterable metadata のキーはベクトルインデックス作成後に変更することはできません。変更する場合は、新たにベクトルインデックスを作成する必要があります。

ベクトルあたりの「フィルタリング可能なメタデータ」のサイズは、最大 2KB です。従って、フィルタリング可能なメタデータに埋め込み時の元データをそのまま保存すると多くのデータを入れることができません。一方、ベクトルあたりの全メタデータ (「フィルタリング可能なメタデータ」 + 「フィルタリングできないメタデータ」) の合計サイズは最大 40KB です。このため、Non-filterable metadata (フィルタリングできないメタデータ) を使うことで、元データの一部あるいは全部を保存することができます。
制約事項については、Limitations and restrictionsを参照してください。

PDF ファイルのベクトル化

ここでは、Amazon Bedrock の Amazon Titan Text Embeddings V2 を使用して、PDF ファイルをベクトル化します。ベクトル化の流れは以下のとおりです。

主要な処理フロー

初期化
- 設定読み込み（チャンクサイズ1024、AWS接続情報など）
- 処理対象PDFファイルのリスト取得
2つの処理アプローチ
- テキスト処理
  - PDF → pypdf → テキスト抽出 → 1024文字チャンクに分割 → Titan埋め込み
- 画像処理
  - PDF → pdf2image → 各ページ画像化 → Claude 4で内容分析 → Markdown化 → 1024文字チャンクに分割 → Titan埋め込み
統合とストレージ
- 両処理の結果を統合
- メタデータ付与（source_type、page_number等）
- S3Vectorsに一括保存

PDF ファイルを準備

ここでは、総務省発行の情報通信白書令和7年版 PDF版から、第2節 AIの爆発的な進展の動向( https://www.soumu.go.jp/johotsusintokei/whitepaper/ja/r07/pdf/n1120000.pdf )を使用します。

Python 環境の準備

コードの実行環境は以下のとおりです。uvのインストールやvenvの作成は省略します。

Python 3.12.3
uv 0.7.4

コード類の配置は以下のとおりです。

.
├── files
│   └── n1120000.pdf
├── requirements.txt
├── embedding.py
└── query.py

requirements.txt

boto3
# PDFテキスト抽出ライブラリ
pypdf
# PDF→画像変換ライブラリ（システム依存: poppler-utils必須）
pdf2image
# Python画像処理ライブラリ - 画像フォーマット変換・Base64エンコード用
Pillow

1. Python パッケージ

Python パッケージをインストールします。

uv pip install -r requirements.txt

2. システム依存関係

PDF 画像変換のために poppler が必要です。

Ubuntu/Debian

sudo apt-get update
sudo apt-get install poppler-utils

macOS

macOS では、Homebrew を使用してインストールします。

brew install poppler

テキストと図表のベクトル化と S3 Vectors への保存

テキストと図表のベクトル化と S3 Vectors への保存を行うコードは以下のとおりです。事前に、使用するリージョンで Titan Text Embeddings V2、Claude Sonnet 4 の各モデルを有効にしておきます。
VECTOR_BUCKET_NAME と INDEX_NAME は、事前に作成したベクトルバケットとインデックスの名前を指定します。

チャンクサイズとオーバーラップはコード内で設定します。ここでは、チャンクサイズを1024文字、オーバーラップをチャンクサイズの10%とします。

# チャンクサイズとオーバーラップ
CHUNK_SIZE = 1024
OVERLAP = int(CHUNK_SIZE * 0.1)

テキストの埋め込みと、画像の解析モデルもコード内で以下のように設定します。

# テキスト埋め込みモデル
TEXT_EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

# 画像解析モデル
IMAGE_PARSING_MODEL_ID = "us.anthropic.claude-sonnet-4-20250514-v1:0"

埋め込みと S3 Vectors への保存を行うサンプルコード

embedding.py

import boto3
import json
import os
import base64
import io
import time
from pypdf import PdfReader
from pdf2image import convert_from_path
from PIL import Image
from typing import List, Dict, Any, Optional
from dataclasses import dataclass

# Bedrock と S3 Vectors のクライアントを作成
bedrock_client = boto3.client('bedrock-runtime', region_name='us-west-2')
s3vectors_client = boto3.client('s3vectors', region_name='us-west-2')

# S3 Vectors のベクトルバケットとインデックス
VECTOR_BUCKET_NAME = "ベクトルバケット名"
INDEX_NAME = "ベクトルインデックス名"

# テキスト埋め込みモデル
TEXT_EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

# 画像解析モデル
IMAGE_PARSING_MODEL_ID = "us.anthropic.claude-sonnet-4-20250514-v1:0"

# チャンクサイズとオーバーラップ
CHUNK_SIZE = 1024
OVERLAP = int(CHUNK_SIZE * 0.1)

# パフォーマンス測定設定
MEASURE_PERFORMANCE = True

def instruction_for_image_parsing() -> str:
    return """
    Extract the content from an image page and output in Markdown syntax. Enclose the content in the <markdown></markdown> tag and do not use code blocks. If the image is empty then output a <markdown></markdown> without anything in it.

    Follow these steps:

    1. Examine the provided page carefully.

    2. Identify all elements present in the page, including headers, body text, footnotes, tables, images, captions, and page numbers, etc.

    3. Use markdown syntax to format your output:
        - Headings: # for main, ## for sections, ### for subsections, etc.
        - Lists: * or - for bulleted, 1. 2. 3. for numbered
        - Do not repeat yourself

    4. If the element is an image (not table)
        - If the information in the image can be represented by a table, generate the table containing the information of the image
        - Otherwise provide a detailed description about the information in image
        - Classify the element as one of: Chart, Diagram, Logo, Icon, Natural Image, Screenshot, Other. Enclose the class in <figure_type></figure_type>
        - Enclose <figure_type></figure_type>, the table or description, and the figure title or caption (if available), in <figure></figure> tags
        - Do not transcribe text in the image after providing the table or description

    5. If the element is a table
        - Create a markdown table, ensuring every row has the same number of columns
        - Maintain cell alignment as closely as possible
        - Do not split a table into multiple tables
        - If a merged cell spans multiple rows or columns, place the text in the top-left cell and output ' ' for other
        - Use | for column separators, |-|-| for header row separators
        - If a cell has multiple items, list them in separate rows
        - If the table contains sub-headers, separate the sub-headers from the headers in another row

    6. If the element is a paragraph
        - Transcribe each text element precisely as it appears

    7. If the element is a header, footer, footnote, page number
        - Transcribe each text element precisely as it appears

    Output Example:
    <markdown>
    <figure>
    <figure_type>Chart</figure_type>
    Figure 3: This chart shows annual sales in millions. The year 2020 was significantly down due to the COVID-19 pandemic.
    A bar chart showing annual sales figures, with the y-axis labeled "Sales ($Million)" and the x-axis labeled "Year". The chart has bars for 2018 ($12M), 2019 ($18M), 2020 ($8M), and 2021 ($22M).
    </figure>

    <figure>
    <figure_type>Chart</figure_type>
    Figure 3: This chart shows annual sales in millions. The year 2020 was significantly down due to the COVID-19 pandemic.
    | Year | Sales ($Million) |
    |-|-|
    | 2018 | $12M |
    | 2019 | $18M |
    | 2020 | $8M |
    | 2021 | $22M |
    </figure>

    # Annual Report

    ## Financial Highlights

    <figure>
    <figure_type>Logo</figure_type>
    The logo of Apple Inc.
    </figure>

    * Revenue: $40M
    * Profit: $12M
    * EPS: $1.25

    | | Year Ended December 31, | |
    | | 2021 | 2022 |
    |-|-|-|
    | Cash provided by (used in): | | |
    | Operating activities | $ 46,327 | $ 46,752 |
    | Investing activities | (58,154) | (37,601) |
    | Financing activities | 6,291 | 9,718 |

    </markdown>
    """

def extract_images_from_pdf(pdf_path: str) -> List[Image.Image]:
    """
    PDFファイルの各ページを画像として抽出する

    Args:
        pdf_path (str): PDFファイルのパス

    Returns:
        List[Image.Image]: 抽出された画像のリスト
    """
    try:
        return convert_from_path(pdf_path)
    except Exception as e:
        print(f"Error extracting images from {pdf_path}: {e}")
        return []

def image_to_base64(image: Image.Image, format: str = "PNG") -> str:
    """
    PIL画像をbase64文字列に変換する

    Args:
        image (Image.Image): PIL画像オブジェクト
        format (str): 画像フォーマット(PNG, JPEG等)

    Returns:
        str: base64エンコードされた画像データ
    """
    try:
        buffer = io.BytesIO()
        image.save(buffer, format=format)
        buffer.seek(0)
        image_bytes = buffer.read()
        return base64.b64encode(image_bytes).decode('utf-8')
    except Exception as e:
        print(f"Error converting image to base64: {e}")
        return ""

def parse_image_content_with_llm(image_base64: str, model_id: str = 'us.anthropic.claude-sonnet-4-20250514-v1:0') -> str:
    """
    LLMを使用して画像の内容を解析し、テキスト説明を生成する(Converse API使用)

    Args:
        image_base64 (str): base64エンコードされた画像データ
        model_id (str): 使用するモデルID、デフォルトはClaude Sonnet 4

    Returns:
        str: 画像内容の説明テキスト
    """

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "image": {
                        "format": "png",
                        "source": {
                            "bytes": base64.b64decode(image_base64)
                        }
                    }
                },
                {
                    "text": instruction_for_image_parsing()
                }
            ]
        }
    ]

    inference_config ={
        "maxTokens": 2500,
        "temperature": 0
    }

    try:
        response = bedrock_client.converse(
            modelId=model_id,
            messages=messages,
            inferenceConfig=inference_config
        )

        if 'output' in response and 'message' in response['output']:
            content = response['output']['message']['content']
            if content and len(content) > 0 and 'text' in content[0]:
                return content[0]['text']

        return "画像内容の解析に失敗しました"

    except Exception as e:
        print(f"Error analyzing image content with LLM: {e}")
        return f"画像解析エラー: {str(e)}"

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    PDFファイルからテキストを抽出する

    Args:
        pdf_path (str): PDFファイルのパス

    Returns:
        str: 抽出されたテキスト
    """
    try:
        reader = PdfReader(pdf_path)
        text = ""

        for page in reader.pages:
            text += page.extract_text() + "\n"

        return text.strip()
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
        return ""

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 100) -> List[str]:
    """
    長いテキストを指定されたサイズのチャンクに分割する

    Args:
        text (str): 分割するテキスト
        chunk_size (int): チャンクの最大サイズ（文字数）
        overlap (int): チャンク間の重複文字数

    Returns:
        List[str]: テキストチャンクのリスト
    """
    if len(text) <= chunk_size:
        return [text]

    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size

        if end >= len(text):
            chunks.append(text[start:])
            break

        # 単語を切り捨てないように、最後のスペースまで戻る
        last_space = text.rfind(" ", start, end)
        if last_space > start:
            end = last_space

        chunks.append(text[start:end])
        start = end - overlap

    return chunks

def create_text_embedding(text: str) -> List[float]:
    """
    埋め込みモデルを使用してテキストの埋め込みベクトルを作成する

    Args:
        text (str): 埋め込みベクトルを作成するテキスト

    Returns:
        List[float]: 埋め込みベクトル
    """
    try:
        payload = json.dumps({
            "inputText": text
        })

        response = bedrock_client.invoke_model(
            modelId=TEXT_EMBEDDING_MODEL_ID,
            body=payload,
            accept='application/json',
            contentType='application/json'
        )

        response_body = json.loads(response['body'].read())
        return response_body['embedding']

    except Exception as e:
        print(f"Error creating text embedding: {e}")
        return []

@dataclass
class ContentItem:
    """
    コンテンツアイテムを表現するデータクラス

    Args:
        content (str): コンテンツテキスト
        source_type (str): ソースタイプ（"pdf_text" or "image_markdown"）
        source_index (int): ソースインデックス
    """
    content: str
    source_type: str
    source_index: int
    metadata: Dict[str, Any]

def process_content_chunks(content_text: str, source_type: str, source_index: int,
                          chunk_size: int = 1000, overlap: int = 100,
                          extra_metadata: Optional[Dict[str, Any]] = None) -> List[ContentItem]:
    """
    コンテンツをチャンク化してContentItemのリストを作成する（共通処理）

    Args:
        content_text (str): チャンク化するテキスト
        source_type (str): ソースタイプ（"pdf_text" or "image_markdown"）
        source_index (int): ソースインデックス
        chunk_size (int): チャンクの最大サイズ
        overlap (int): チャンク間の重複文字数
        extra_metadata (Optional[Dict[str, Any]]): 追加メタデータ

    Returns:
        List[ContentItem]: ContentItemのリスト
    """
    if not content_text or len(content_text.strip()) < 10:
        return []

    # テキストをチャンク化
    chunks = chunk_text(content_text, chunk_size=chunk_size, overlap=overlap)

    content_items = []
    for i, chunk in enumerate(chunks):
        metadata = {
            "source_type": source_type,
            "source_index": source_index,
            "chunk_index": i,
            "length": len(chunk)
        }
        if extra_metadata:
            metadata.update(extra_metadata)

        content_items.append(ContentItem(
            content=chunk,
            source_type=source_type,
            source_index=source_index,
            metadata=metadata
        ))

    return content_items

def process_content_items_to_vectors(content_items: List[ContentItem], result: Dict[str, Any]) -> None:
    """
    ContentItemsを埋め込みベクトルに変換して結果に追加する（共通処理）

    Args:
        content_items (List[ContentItem]): ContentItemのリスト
        result (Dict[str, Any]): 処理結果の辞書
    """
    for item in content_items:
        print(f"        Processing {item.source_type} chunk {item.metadata['chunk_index']+1}")
        text_embedding = create_text_embedding(item.content)
        if text_embedding:
            chunk_info = {
                "index": len(result["text_chunks"]),
                "text": item.content,
                "length": len(item.content),
                "type": item.source_type
            }
            chunk_info.update(item.metadata)

            result["text_chunks"].append(chunk_info)
            result["text_embeddings"].append(text_embedding)

def store_embeddings_to_s3vectors(pdf_path: str, result: Dict[str, Any], store_to_s3: bool = True) -> Dict[str, Any]:
    """
    埋め込みベクトルデータをS3Vectorsに保存する

    Args:
        pdf_path (str): PDFファイルのパス
        result (Dict[str, Any]): process_pdf_fileの結果
        store_to_s3 (bool): S3Vectorsに実際に保存するかどうか

    Returns:
        Dict[str, Any]: 保存結果の統計情報
    """
    if not store_to_s3:
        return {"skipped": True, "reason": "store_to_s3=False"}

    filename = os.path.basename(pdf_path)
    vectors_to_store = []
    stats: Dict[str, Any] = {"text_vectors": 0, "total": 0}

    # テキスト埋め込みベクトルを追加
    for i, (chunk_info, embedding) in enumerate(zip(result["text_chunks"], result["text_embeddings"])):
        vector_key = f"{filename}_text_chunk_{i}"
        vectors_to_store.append({
            "key": vector_key,
            "data": {"float32": embedding},
            "metadata": {
                "id": vector_key,
                "source_file": filename,
                "type": "text",
                "chunk_index": i,
                "chunk_length": chunk_info["length"],
                "source_text": chunk_info["text"][:1024],
                "full_text_length": len(chunk_info["text"])
            }
        })
        stats["text_vectors"] += 1

    stats["total"] = len(vectors_to_store)

    if not vectors_to_store:
        return {"error": "No vectors to store"}

    try:
        # S3Vectorsに保存
        response = s3vectors_client.put_vectors(
            vectorBucketName=VECTOR_BUCKET_NAME,
            indexName=INDEX_NAME,
            vectors=vectors_to_store
        )

        stats["success"] = True
        stats["response"] = response

        return stats

    except Exception as e:
        print(f"Error storing vectors to S3Vectors: {e}")
        stats["error"] = str(e)
        return stats

def process_pdf_file(pdf_path: str, chunk_size: int = 1000, overlap: int = 100) -> Dict[str, Any]:
    """
    PDFファイルを処理してテキストを抽出し、埋め込みベクトルを作成する
    マルチモーダル処理：ページを画像化し、画像の内容を解析してテキストに変換する。テキストから埋め込みベクトルを作成する。

    Args:
        pdf_path (str): PDFファイルのパス
        chunk_size (int): チャンクの最大サイズ（文字数）、デフォルト: 1000
        overlap (int): チャンク間の重複文字数、デフォルト: 100

    Returns:
        Dict[str, Any]: 処理結果（テキストチャンクと埋め込みベクトル）
    """
    if not os.path.exists(pdf_path):
        print(f"Error: File {pdf_path} does not exist")
        return {"error": f"File {pdf_path} does not exist"}

    print(f"Processing PDF: {pdf_path}")

    # 結果データ構造を初期化
    result = {
        "file_path": pdf_path,
        "text_chunks": [],
        "text_embeddings": []
    }

    print("Processing PDF with multimodal approach...")

    # テキスト処理: PDFファイルからテキストを抽出し、チャンク化して埋め込みベクトルを作成する
    text = extract_text_from_pdf(pdf_path)
    if text:
        print(f"Extracted text length: {len(text)} characters")
        print(f"Processing PDF text content...")
        content_items = process_content_chunks(
            text,
            "pdf_text",
            0,
            chunk_size,
            overlap
        )
        print(f"Split into {len(content_items)} chunks")
        process_content_items_to_vectors(content_items, result)

    # 画像処理（既存の処理を維持）
    images = extract_images_from_pdf(pdf_path)
    for i, image in enumerate(images):
        print(f"Processing image page {i+1}/{len(images)}")

        # 画像をbase64に変換
        image_base64 = image_to_base64(image)
        if not image_base64:
            continue

        # 画像の内容をLLMで解析
        print(f"    Parsing image content with LLM...")
        image_description = parse_image_content_with_llm(image_base64, model_id=IMAGE_PARSING_MODEL_ID)

        # 画像の内容からテキストをチャンク化
        print(f"    Chunking text extracted from image description...")
        extra_metadata = {"page_number": i + 1, "format": "PNG"}
        content_items = process_content_chunks(
            image_description,
            "image_markdown",
            i,
            chunk_size,
            overlap,
            extra_metadata
        )
        # テキストから埋め込みベクトルを作成する。
        process_content_items_to_vectors(content_items, result)

    print(f"Processing completed: {len(result['text_chunks'])} text chunks (includes image-derived text)")
    return result

if __name__ == "__main__":
    pdf_files = [
        "files/n1120000.pdf"
    ]

    print(f"\n{'='*60}")
    print(f"PDF Processing Configuration:")
    print(f"  Processing Mode: Multimodal (text + image)")
    print(f"  Chunk Size: {CHUNK_SIZE}")
    print(f"  Overlap: {OVERLAP}")
    print(f"  Performance Measurement: {MEASURE_PERFORMANCE}")
    print('='*60)

    for pdf_file in pdf_files:
        if os.path.exists(pdf_file):
            print(f"\n{'='*50}")
            print(f"Processing: {pdf_file}")

            start_time = time.time() if MEASURE_PERFORMANCE else None

            # PDFファイルを処理してテキスト抽出とページの画像化を行い、埋め込みベクトルを作成する
            result = process_pdf_file(
                pdf_file,
                chunk_size=CHUNK_SIZE,
                overlap=OVERLAP
            )

            # 処理時間を計測
            if MEASURE_PERFORMANCE and start_time is not None:
                processing_time = time.time() - start_time
                print(f"Processing time: {processing_time:.2f} seconds")

            if "error" in result:
                print(f"Error: {result['error']}")
                continue

            # S3Vectorsに埋め込みベクトルを保存
            storage_result = store_embeddings_to_s3vectors(pdf_file, result, store_to_s3=True)
            if "error" in storage_result:
                print(f"\n Storage failed: {storage_result['error']}")

        else:
            print(f"File not found: {pdf_file}")

コード内のinstruction_for_image_parsing()関数で、画像化したPDFの内容を解析するためのパーサー向けの指示を作成しています。

def instruction_for_image_parsing(image_description: str) -> str:
    return """
    Extract the content from an image page and output in Markdown syntax. ...
    """

この指示は、Bedrock knowledge basesを作成する際にパーサーとしての基盤モデルを選択すると表示されるものをそのまま利用しています。Bedrock knowledge basesのパーサーはPDFなどの非テキストデータを画像化して解析を行っていることから、この指示を利用することで、Bedrock knowledge basesと同様の処理を再現することができます。

コードを実行します。

uv run python ./embedding.py

このコードを実行すると、以下のように処理結果が表示されます。指定したチャンクサイズに基づいて 31個のテキストチャンクが作成されました。それぞれのテキストチャンクに対して Amazon Titan Text Embeddings V2 を使用して埋め込みベクトルが作成され、S3 Vectors に保存されました。

============================================================
PDF Processing Configuration:
  Processing Mode: Multimodal (text + image)
  Chunk Size: 1024
  Overlap: 102
  Performance Measurement: True
============================================================

==================================================
Processing: files/n1120000.pdf
Processing PDF: files/n1120000.pdf
Processing PDF with multimodal approach...
Extracted text length: 25629 characters
Processing PDF text content...
Split into 31 chunks
        Processing pdf_text chunk 1
        Processing pdf_text chunk 2
...
Processing image page 1/12
    Parsing image content with LLM...
    Chunking text extracted from image description...
        Processing image_markdown chunk 1
        Processing image_markdown chunk 2
...
Processing image page 12/12
    Parsing image content with LLM...
    Chunking text extracted from image description...
        Processing image_markdown chunk 1
        Processing image_markdown chunk 2
Processing completed: 66 text chunks (includes image-derived text)
Processing time: 463.51 seconds

登録したベクトルの確認

awsコマンドを実行し、登録したベクトルの確認を行います。AWS CLIは最新のバージョンを使用してください。
ベクトルバケット名とインデックス名は、事前に作成したベクトルバケットとインデックスの名前を指定します。リージョン名は、使用するリージョンを指定します。--max-itemsは、取得するベクトルの最大数を指定します。

aws s3vectors list-vectors --vector-bucket-name "ベクトルバケット名" --index-name "ベクトルインデックス名" --return-metadata --return-data --max-items 1 --region "リージョン名"

このコマンドを実行すると、以下のように登録したベクトルのメタデータが表示されます。(ベクトルデータは一部省略しています。)
source_text には、テキストチャンクの一部が保存されています。このように、テキストチャンクの一部を保存することで、ベクトル検索やベクトルの取得を行った際にテキストチャンクの一部を取得することができます。これにより、ベクトル検索の結果から source_text を取得し LLM に渡すことで、LLM が回答を生成することができます。

{
    "vectors": [
        {
            "key": "n1120000.pdf_text_chunk_19",
            "data": {
                "float32": [
                    0.021636517718434334,
                    0.1118633821606636,
                    ...
                ]
            },
            "metadata": {
                "full_text_length": 1007,
                "source_text": "研究力・開発力醸成に貢献する取組を行っている。\nΠɹAI޲\n特に人型ロボットの研究開発・社会実装においては米国と中国等が先行している状況であるが、...",
                "chunk_index": 19,
                "source_file": "n1120000.pdf",
                "id": "n1120000.pdf_text_chunk_19",
                "chunk_length": 1007,
                "type": "text"
            }
        }
    ],
    "NextToken": "eyJuZXh0VG9rZW4iOiBudWxsLCAiYm90b190cnVuY2F0ZV9hbW91bnQiOiAxfQ=="
}

ベクトルの削除

登録したベクトルを削除する場合、aws s3vectors delete-vectorコマンドもしくは、スクリプトを使ってまとめて削除することができます。
詳しくは、前回の記事のベクトルの削除の項目を参照してください。

ベクトルの上書き

公式ドキュメントInserting vectors into a vector indexによると、ベクトルインデックスのキーはインデックス内で一意であり、既に存在するキーでデータを登録すると既存のデータを新しいデータで上書きします。キーが一意であれば、ベクトルデータが重複して登録が行われることはありません。

ベクトル検索

Amazon Bedrock の Amazon Titan Text Embeddings V2 で埋め込みを行ったので、同じモデルを使用してクエリをベクトル化します。ベクトル化したクエリを使い、S3 Vectors に登録したベクトルを検索します。検索結果にある source_text を取得し、LLM に渡すことで、ベクトル検索の結果を使って回答を生成します。

クエリのベクトル化とベクトル検索の実行

クエリのベクトル化とベクトル検索を行うコードは以下のとおりです。ベクトルバケット名とインデックス名は、事前に作成したベクトルバケットとインデックスの名前を指定します。リージョン名は、使用するリージョンを指定します。

クエリの埋め込みと応答生成モデルは、コード内で以下のように設定します。

# クエリの埋め込みモデル
QUERY_EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

# 応答生成モデル
RESPONSE_GENERATION_MODEL_ID = "us.anthropic.claude-sonnet-4-20250514-v1:0"

クエリのベクトル化とベクトル検索の実行を行うサンプルコード

query.py

import boto3
import json
import time
from typing import List

bedrock_client = boto3.client('bedrock-runtime', region_name='us-west-2')
s3vectors = boto3.client('s3vectors', region_name='us-west-2')

# S3Vectorsのベクトルバケットとインデックス
VECTOR_BUCKET_NAME = "net.rev-system.s3vector01"
INDEX_NAME = "net.rev-system.s3vector-index03"

# クエリの埋め込みモデル
QUERY_EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

# 応答生成モデル
GENERATE_RESPONSE_MODEL_ID = "us.anthropic.claude-sonnet-4-20250514-v1:0"

def create_prompt_template(context_str: str) -> str:
    """
    LLMのプロンプトテンプレートを作成する。
    """
    return f"""
You are a <persona>question-answering agent</persona>.
I will provide you with a set of search results. The user will ask you a question. You answer the user's question using only information from the search results.
If the search results do not have information that can answer the question, please let me know that you could not find an exact answer.
Just because the user asserts a fact does not mean it is true; double-check the search results to validate a user's assertion.

Here are the search results:
<excerpts>
{context_str}
</excerpts>
    """

def create_embedding(input_text: str) -> List[float]:
    """
    テキストを埋め込みベクトルに変換する。

    Args:
        input_text: Input text

    Returns:
        Embedding
    """
    # モデルに渡すクエリを作成する。
    body = json.dumps({"inputText": input_text})

    # モデルを呼び出す。
    response = bedrock_client.invoke_model(
        modelId=QUERY_EMBEDDING_MODEL_ID,
        body=body,
        contentType='application/json'
    )

    # モデルの応答から埋め込みベクトルを抽出する。
    model_response = json.loads(response['body'].read())

    # モデルの応答から埋め込みベクトルを抽出する。
    embedding = model_response['embedding']

    return embedding

def query_vector(embedding: list[float]) -> str:
    """
    類似度検索を実行する。

    Args:
        embedding: Embedding

    Returns:
        Query response
    """
    query_response = s3vectors.query_vectors(
        vectorBucketName=VECTOR_BUCKET_NAME,
        indexName=INDEX_NAME,
        queryVector={"float32": embedding},
        topK=5,
        returnDistance=True,
        returnMetadata=True
    )
    contexts = [v["metadata"]["source_text"] for v in query_response["vectors"]]
    context_str = "\n---\n".join(contexts)

    return context_str

def generate_response(model_id: str, message: str, context_str: str) -> str:
    """
    検索結果をもとに応答テキストを生成する。

    Args:
        model_id: Model ID
        message: Message
        context_str: Context string

    Returns:
        LLMの応答
    """
    system_prompts = [
        {
            "text": create_prompt_template(context_str)
        }
    ]

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "text": "<question>" + message + "</question>",
                }
            ]
        }
    ]

    inference_config = {
        "maxTokens": 4096,
        "temperature": 0.1
    }

    print(f"Context size: {len(context_str)} chars, Max tokens: {inference_config['maxTokens']}")

    try:
        llm_response = bedrock_client.converse(
            modelId=model_id,
            system=system_prompts,
            messages=messages,
            inferenceConfig=inference_config
        )

        if 'output' in llm_response and 'message' in llm_response['output']:
            content = llm_response['output']['message']['content']
            if content and len(content) > 0 and 'text' in content[0]:
                return content[0]['text']

        return "No response from LLM"

    except Exception as e:
        print(f"Error generating response: {e}")
        return f"Error generating response: {str(e)}"

def main():

    message = "AI活力ランキングで5位の国はどこですか? そのほかの順位も教えてください。また、それぞれの国はどのような分野で優れているかも教えてください。"

    print(f"Processing message: {message}")

    # 埋め込み生成の処理時間を計測
    start_time = time.time()
    embedding = create_embedding(message)
    embedding_time = time.time() - start_time
    print(f"Embedding creation time: {embedding_time:.3f} seconds")

    # ベクトル検索の処理時間を計測
    start_time = time.time()
    context_str = query_vector(embedding)
    query_time = time.time() - start_time
    print(f"Vector query time: {query_time:.3f} seconds")

    # LLM応答生成の処理時間を計測
    start_time = time.time()
    response = generate_response(GENERATE_RESPONSE_MODEL_ID, message, context_str)
    response_time = time.time() - start_time
    print(f"Response generation time: {response_time:.3f} seconds")

    # 合計時間
    total_time = embedding_time + query_time + response_time
    print(f"Total processing time: {total_time:.3f} seconds")

    print("*" * 50)
    print("Response: ", response)

if __name__ == "__main__":
    main()

コード内のcreate_prompt_template()関数で、ベクトル検索の結果をもとに応答テキストを生成するためのプロンプトを作成しています。

def create_prompt_template(context_str: str) -> str:
    """
    LLMのプロンプトテンプレートを作成する。
    """
    return f"""
You are a <persona>question-answering agent</persona>. ...
"""

このプロンプトは、Bedrock knowledge basesのマネージメントコンソールでテストを行う際の生成プロンプトをそのまま利用しています。

コードを実行します。

uv run python ./query.py

このコードを実行すると、以下のようにベクトル検索の結果が出力されます。質問の埋め込みにかかった時間や、S3 Vectors のベクトル検索にかかった時間、LLM の応答生成にかかった時間、合計時間が出力されます。それぞれの時間は利用するモデルによって異なります。精度を重視すると速度が犠牲になるなど、トレードオフの関係にあります。

Processing message: AI活力ランキングで5位の国はどこですか? そのほかの順位も教えてください。また、それぞれの国はどのような分野で優れているかも教えてください。
Embedding creation time: 1.263 seconds
Vector query time: 1.818 seconds
Context size: 3953 chars, Max tokens: 4096
Response generation time: 7.764 seconds
Total processing time: 10.845 seconds
**************************************************
Response:  検索結果の図表1-1-2-4「AI活力ランキング上位10カ国（2023年）」によると、5位の国は**アラブ首長国連邦（United Arab Emirates）**です。

以下が上位10カ国の順位です：

1. **アメリカ（United States）**
2. **中国（China）**
3. **イギリス（United Kingdom）**
4. **インド（India）**
5. **アラブ首長国連邦（United Arab Emirates）**
6. **フランス（France）**
7. **韓国（South Korea）**
8. **ドイツ（Germany）**
9. **日本（Japan）**
10. **シンガポール（Singapore）**

各国の優れている分野については、図表のバーチャートから以下のような特徴が読み取れます：

- **アメリカ（1位）**: R&D、Economy、Policy and Governance、Infrastructure で特に強い
- **中国（2位）**: R&D、Economy で強い
- **イギリス（3位）**: Education で特に強く、バランスが良い
- **インド（4位）**: R&D、Economy、Education で強い
- **アラブ首長国連邦（5位）**: R&D、Diversity、Policy and Governance で強い
- **フランス（6位）**: Education、Diversity、Policy and Governance で強い
- **韓国（7位）**: Education、Diversity、Policy and Governance で強い
- **ドイツ（8位）**: Education、Diversity、Policy and Governance で強い
- **日本（9位）**: Education、Diversity、Policy and Governance で比較的強い
- **シンガポール（10位）**: Diversity、Policy and Governance で強い

このランキングは、スタンフォード大学のHAI（Human-Centered Artificial Intelligence）が2024年11月に発表した2023年のデータに基づいています。

質問にある「AI活力ランキング」は、読み込んだPDF(総務省発行の情報通信白書令和7年版 PDF版第2節 AIの爆発的な進展の動向( https://www.soumu.go.jp/johotsusintokei/whitepaper/ja/r07/pdf/n1120000.pdf ))の6ページ目に画像として配置されています。

embedding.pyの処理フロー PDF ファイルのベクトル化にあるように、各ページを画像として抽出し、LLM によってMarkdown化しています。それをAmazon Titan Text Embeddings V2でベクトル化しているため、画像の内容を検索することができます。

図をどのようにMarkdown化しているかを、aws s3vectors list-vectorsコマンドを実行し、メタデータにある source_text を確認してみました。見やすいよう、改行コードを実際の改行に置き換えています。バーチャートを█で表現していることが分かります。割合を正しく表現できているかは疑問が残りますが、概ね正しく表現できていると思われます。

# AI活力ランキング上位10カ国 (2023年)
<figure>
<figure_type>Chart</figure_type>
図表1-1-2-4 AI活力ランキング上位10カ国 (2023年)
| 順位 | 国名 | R&D | Responsible AI | Economy | Education | Diversity | Policy and Governance | Public Opinion | Infrastructure |
|-|-|-|-|-|-|-|-|-|-|
| 1 | United States | ████████████ | ██ | ████████████ | ██████ | ████ | ████████ | ████ | ████████████ |
| 2 | China | ████████████ | ██ | ████████████ | ██ | ██ | ████ | ██ | ████████ |
| 3 | United Kingdom | ████ | ██ | ████████ | ████████████ | ██ | ████ | ██ | ████████ |
| 4 | India | ████████████ | ██ | ████████ | ████████ | ██ | ██ | ██ | ████ |
| 5 | United Arab Emirates | ████ | ██ | ████ | ████████ | ██ | ████████ | ██ | ████████ |
| 6 | France | ████ | ██ | ████ | ████████████ | ██ | ████████ | ██ | ████████ |
| 7 | South Korea | ████ | ██ | ████ | ████████ | ██ | ████████ | ██ | ████████ |
| 8 | Germany | ████ | ██ | ████ | ████████████ | ██ | ████████ | ██ | ████████ |
| 9 | Japan | ████ | ██ | ████ | ████████ | ██ | ████████ | ██ |
| 10 | Singapore | ████ | ██ | ████ | ████████ | ██ | ████████ | ██ | ████ |
</figure>

次に、他のページにある図表についても質問してみました。これについても、図表内の情報が回答に含まれていることが分かります。

Processing message: 国別の生成AIサービス利用経験について教えてください。
Embedding creation time: 1.493 seconds
Vector query time: 1.305 seconds
Context size: 4701 chars, Max tokens: 4096
Response generation time: 7.601 seconds
Total processing time: 10.399 seconds
**************************************************
Response:  検索結果に基づいて、国別の生成AIサービス利用経験についてお答えします。

## 国別の生成AIサービス利用経験（2024年度調査）

**利用経験がある割合：**
- **中国**: 81.2%（最も高い）
- **米国**: 68.8%
- **ドイツ**: 59.2%
- **日本**: 26.7%（最も低い）

## 前年度からの変化

各国とも2023年度から2024年度にかけて利用経験が大幅に拡大しています：

- **中国**: 56.3% → 81.2%（+24.9ポイント）
- **米国**: 46.3% → 68.8%（+22.5ポイント）
- **ドイツ**: 34.6% → 59.2%（+24.6ポイント）
- **日本**: 9.1% → 26.7%（+17.6ポイント）

## 特徴

1. **日本の利用率は他国と比較して低い**：日本は26.7%で、他の3か国（59.2%〜81.2%）と比べて大きく下回っています。

2. **全ての国で利用が拡大**：調査対象の4か国すべてで、前年度から大幅に利用経験が増加しています。

3. **中国が最も高い利用率**：中国は81.2%と、調査対象国の中で最も高い利用経験率を示しています。

出典：総務省（2025）「国内外における最新の情報通信技術の研究開発及びデジタル活用の動向に関する調査研究」

質問にある「国別の生成AIサービスの利用経験」はPDFの9ページ目に画像として配置されています。最初の質問と同様に、バーチャートの内容が正しく認識されていることが分かります。

まとめ

この記事では、Amazon S3 Vectors を使って図表を含む PDF ファイルでシンプルな RAG を構築する方法を検証しました。主な成果は以下のとおりです。

マルチモーダル処理の実現: PDF からテキスト抽出と画像解析の両方を組み合わせることで、文字情報だけでなく図表の内容も検索対象にできることを確認しました。これにより、従来のテキストベースの RAG では取得困難だった視覚的な情報も活用できます。
図表データの正確な解析: Claude 4 を使用した画像解析により、バーチャートや表形式のデータを適切に Markdown 形式に変換でき、「AI活力ランキング」や「国別の生成 AI サービス利用経験」などの図表情報を基にした質問応答が可能になりました。
Amazon S3 Vectors の実用性確認: ベクトル検索のレスポンス時間は約 1.8 秒と低コストで試せる環境としては実用的なレベルで、柔軟なデータ処理が可能であることを確認しました。Amazon Bedrock Knowledge Bases と比較して手動実装は必要ですが、データ解析のカスタマイズ性が高いことが利点です。

この検証により、S3 Vectors を活用したマルチモーダル RAG システムの構築手法が確認でき、図表を含む複合的なドキュメントに対する高精度な情報検索システムの実現可能性を示すことができました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up