Qiita Engineer Festa20242024年7月17日まで開催中！

Knowledge Bases for Amazon Bedrockに新しく追加されたカスタムチャンキングを試す

Posted at 2024-07-12

Knowledge Bases for Amazon Bedrockがカスタムチャンキングに対応しました。

What's Newによると、

お客様は独自のチャンキングコードをLambda関数として記述できるようになり、LangChainやLlamaIndexなどのフレームワークの既成コンポーネントを使用することもできます。

これまでは、「デフォルトチャンキング（自動分割）」「固定サイズのチャンキング」「チャンキングなし」からしか選べませんでしたが、Lambdaを呼び出して好きなチャンク分けができるようになりました。

同じタイミングで、「階層型チャンキング」「セマンティックチャンキング」にも対応しました。

今回は、カスタムチャンキングを試してみました。

参照した公式ドキュメント：
https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking-parsing.html#kb-custom-transformation

いつつかう？？

確証はないのですが、Knowledge Bases for Amazon Bedrockのチャンキングは「日本語を上手に扱えていないのでは？」と感じることがあります。

デフォルトチャンキングの設定の場合、上限が300トークンになるはずですが、5000文字を超える日本語文章を入力した際に 全くチャンキングされない ことがありました。

日本語で適切にチャンキングする場合は、カスタムチャンキングを使ったほうが良いかもしれません。

設定

データソースを作成する途中で、以下の画面で設定します。

チャンキング戦略と並列した設定のため、チャンキング戦略と同時に指定できます。呼び出されるLambdaではメタデータ付与もできるため、以下のような使い方が想定されます。

チャンキング戦略を「チャンキングなし」にし、Lambdaで独自のチャンキングを行う
チャンキングは「デフォルトチャンキング」などで行い、Lambdaでメタデータを付与する

組み合わせも可能なので、デフォルトチャンキングで分割したものを更にLambdaでチャンキングすることも可能です。

チャンキング対象の文字列データは、S3に出力されるため、S3のバケットの指定が必要です。

検証内容

以下のブログの本文をテキストファイルに格納しました。

このファイルをS3に格納しました。
チャンキング戦略は「デフォルトチャンキング」を選択しました。

動作の流れ

図にしてみました。複雑に見えますがそうでもありません。

データソースの同期を行うとチャンキングが開始されます。

登録対象ファイル取得
S3からチャンキング対象のファイルを取得します
（オプション）チャンキング
チャンキング戦略で指定がある場合は、チャンキングを実行します

カスタムチャンキング対象の格納
以下の形式のJSONがS3に格納されます

knowledge-bases-amazon-bedrock-advanced-rag-capabilities.txt_1.JSON

{
    "fileContents": [
        {
            "contentType": "PLAIN_TEXT",
            "contentMetadata": {},
            "contentBody": "Knowledge Bases for Amazon Bedrock now supports advanced RAG capabilities\r Posted on: Jul 10, 2024\r Knowledge Bases for Amazon Bedrock is a fully managed Retrieval-Augmented Generation (RAG) capability that allows you to connect foundation models (FMs) to internal company data sources to deliver relevant and accurate responses. Chunking allows processing long documents by breaking them into smaller chunks, enabling accurate knowledge retrieval from a user’s question. Today, we are launching advanced chunking options. The first is custom chunking. With this, customers can write their own chunking code as a Lambda function, and even use off the shelf components from frameworks like LangChain and LlamaIndex. Additionally, we are launching built-in chunking options such as semantic and hierarchical chunking.\r \r Additionally, customers can enable smart parsing to extract information from more complex data such as tables. This capability uses Amazon Bedrock foundation models to parse tabular content in file formats such as PDF to improve retrieval accuracy. You can customize parsing prompts to extract data in the format of your choice. Knowledge Bases now also supports query reformulation. This capability breaks down queries into simpler sub-queries, retrieves relevant information for each, and combines the results into a final comprehensive answer. With these new accuracy improvements for chunking, parsing, and advanced query handling, Knowledge Bases empowers users to build highly accurate and relevant knowledge resources suited for enterprise use cases.\r \r These capabilities are supported in the all AWS Regions where Knowledge Bases is available."
        },
        {
            "contentType": "PLAIN_TEXT",
            "contentMetadata": {},
            "contentBody": "With these new accuracy improvements for chunking, parsing, and advanced query handling, Knowledge Bases empowers users to build highly accurate and relevant knowledge resources suited for enterprise use cases.\r \r These capabilities are supported in the all AWS Regions where Knowledge Bases is available. To learn more about these features and how to get started, refer to the Knowledge Bases for Amazon Bedrock documentation and visit the Amazon Bedrock console."
        }
    ]
}

長い文章ではありませんが、デフォルトチャンキングの設定に従い、2つのチャンクに分割されています。

カスタムチャンキング処理呼び出し
BedrockからLambdaが呼び出されます

呼び出し時にeventで渡される値は以下の内容です。

「originalFileLocation」に取り込み対象のS3オブジェクトのパスがセットされます。
「contentBatches」にはLambdaが呼ばれる前に作成されたチャンキング対象のS3オブジェクトのパスがセットされます。

event

{
    "version": "1.0",
    "knowledgeBaseId": "VYRECPTAFA",
    "dataSourceId": "XJG3M5DC8O",
    "ingestionJobId": "QYIM1V2ZJ7",
    "bucketName": "bedrock-kb-custom-transformation",
    "priorTask": "CHUNKING",
    "inputFiles": [
        {
            "contentBatches": [
                {
                    "key": "aws/bedrock/knowledge_bases/VYRECPTAFA/XJG3M5DC8O/QYIM1V2ZJ7/knowledge-bases-amazon-bedrock-advanced-rag-capabilities.txt_1.JSON"
                }
            ],
            "originalFileLocation": {
                "type": "S3",
                "s3_location": {
                    "uri": "s3://bedrock-637423213562/knowledge-base-quick-start-p3ig1-data-source/knowledge-bases-amazon-bedrock-advanced-rag-capabilities.txt"
                }
            }
        }
    ]
}

階層型チャンキングの場合

イベント

event

{
    "fileContents": [
        {
            "contentType": "PLAIN_TEXT",
            "contentMetadata": {
                "x-amz-bedrock-kb-chunk-parent": "親チャンク文字列"
            },
            "contentBody": "子チャンク文字列"
        }
    ]
}

対象ファイルの取得
Lambdaにはパス情報しかわたってこないので、Boto3などを使用しファイルを取得します
カスタムチャンキング実行
今回は改行文字で分割してみました。

カスタムチャンキング後ファイルの格納
チャンキングしたものをJSON形式でS3に保存します
JSONファイルのフォーマットは、チャンキング対象のファイルとして渡されたものと同一です

必要であれば、メタデータを付与することも可能です。

knowledge-bases-amazon-bedrock-advanced-rag-capabilities.txt_1.JSON_out_0

{
    "fileContents": [
        {
            "contentMetadata": {},
            "contentBody": "Knowledge Bases for Amazon Bedrock now supports advanced RAG capabilities",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " Posted on: Jul 10, 2024",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " Knowledge Bases for Amazon Bedrock is a fully managed Retrieval-Augmented Generation (RAG) capability that allows you to connect foundation models (FMs) to internal company data sources to deliver relevant and accurate responses. Chunking allows processing long documents by breaking them into smaller chunks, enabling accurate knowledge retrieval from a user\u2019s question. Today, we are launching advanced chunking options. The first is custom chunking. With this, customers can write their own chunking code as a Lambda function, and even use off the shelf components from frameworks like LangChain and LlamaIndex. Additionally, we are launching built-in chunking options such as semantic and hierarchical chunking.",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " ",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " Additionally, customers can enable smart parsing to extract information from more complex data such as tables. This capability uses Amazon Bedrock foundation models to parse tabular content in file formats such as PDF to improve retrieval accuracy. You can customize parsing prompts to extract data in the format of your choice. Knowledge Bases now also supports query reformulation. This capability breaks down queries into simpler sub-queries, retrieves relevant information for each, and combines the results into a final comprehensive answer. With these new accuracy improvements for chunking, parsing, and advanced query handling, Knowledge Bases empowers users to build highly accurate and relevant knowledge resources suited for enterprise use cases.",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " ",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " These capabilities are supported in the all AWS Regions where Knowledge Bases is available.",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": "With these new accuracy improvements for chunking, parsing, and advanced query handling, Knowledge Bases empowers users to build highly accurate and relevant knowledge resources suited for enterprise use cases.",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " ",
            "contentType": "PLAIN_TEXT"
        },
        {
            "contentMetadata": {},
            "contentBody": " These capabilities are supported in the all AWS Regions where Knowledge Bases is available. To learn more about these features and how to get started, refer to the Knowledge Bases for Amazon Bedrock documentation and visit the Amazon Bedrock console.",
            "contentType": "PLAIN_TEXT"
        }
    ]
}

カスタムチャンキング終了応答
処理が終わったら、以下の形式でLambdaのレスポンスを返却します

Lambdaレスポンス

{
  "outputFiles": [
    {
      "originalFileLocation": {
        "type": "S3",
        "s3_location": {
          "uri": "s3://bedrock-637423213562/knowledge-base-quick-start-p3ig1-data-source/knowledge-bases-amazon-bedrock-advanced-rag-capabilities.txt"
        }
      },
      "fileMetadata": {},
      "contentBatches": [
        {
          "key": "aws/bedrock/knowledge_bases/VYRECPTAFA/XJG3M5DC8O/N6LM5LYABW/knowledge-bases-amazon-bedrock-advanced-rag-capabilities.txt_1.JSON_out_0"
        }
      ]
    }
  ]
}

カスタムチャンキング後ファイルの取得
格納したファイルをBedrockが取得します
登録
カスタムチャンキングした内容がベクトルデータベースに登録されます

これで、独自でチャンキングした内容でデータソースに登録されます。

検証に使用したLambdaのソース

import json
import boto3

s3 = boto3.resource("s3")


def lambda_handler(event, context):
    # TODO implement
    print(event)

    bucket = s3.Bucket(event["bucketName"])

    response_event = {"outputFiles": []}

    for file in event["inputFiles"]:

        output_file = {
            "originalFileLocation": file["originalFileLocation"],
            "fileMetadata": file.get("fileMetadata", {}),
            "contentBatches": [],
        }
        response_event["outputFiles"].append(output_file)

        for n, content in enumerate(file["contentBatches"]):

            content_input_key = content["key"]
            content_object = bucket.Object(content_input_key)

            content_text = content_object.get()["Body"].read().decode("utf-8")
            content_json = json.loads(content_text)

            print(content_json)

            output_file_content = []
            for file_content in content_json["fileContents"]:
                body = file_content["contentBody"]

                for split_body in body.split("\r"):  # 改行で分割
                    output_file_content.append(
                        {
                            "contentMetadata": file_content["contentMetadata"],
                            "contentBody": split_body,
                            "contentType": file_content["contentType"],
                        }
                    )

            output = {"fileContents": output_file_content}

            content_output_key = f"{content_input_key}_out_{n}"
            content_output_object = bucket.Object(content_output_key)

            content_output_object.put(
                Body=json.dumps(output).encode("utf-8"),
                ContentEncoding="utf-8",
                ContentType="application/json",
            )

            output_file["contentBatches"].append({"key": content_output_key})

    return response_event

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up