More than 1 year has passed since last update.

【Amazon Kendra】Web Crawler connector v2.0で追加されたField Mappingの使い方

Posted at 2023-11-27

Amazon Kendraのデータソースである「Web Crawler connector v2.0」(以下Crawler v2)より追加された機能のうちの一つとして「Field Mapping」があります。

できること

HTML内の任意のmetaタグを取得し、Index Fieldとして登録、利用が可能になります。
- 作成するのはデータソース作成時ですが、作成と同時にインデックスの「Facet definition」から確認できる「Index Field」にも作成されます。
  - 作成した「Index Field」は名前とデータ型を削除も変更もできないので注意（2023/11/28時点）

以下の設定でsample_string_site.htmlをクロールした場合、Index FieldでField Name「Servise」にString型の値「Amazon Kendra」が登録されます。

KendraのCrawler v2作成または変更画面からStep 4.「Set field mappings」内の「Add field」

sample_string_site.html

・
・
<head>

  <meta name="keyword" content="Amazon Kendra"> 
  
  <title>Document</title>
</head>
.
.

Date型のデータタイプ

Crawler v2の作成、変更画面からDate型のデータ型は提供されていません。（2023/11/28時点）
そのため、あらかじめDate型のIndex Fieldを作成し、仮のString型のデータ型に登録し、Custom Document Enrichment（以下CDE）でDate型のIndex Fieldに登録し直す必要があります。
- Crawler v2はbodyタグ内のみインデックスの情報として持つため、一旦Field Mappingでmetaタグから取得しCDE内で利用可能にする必要があります。
- 値としてはDate型となっていますが、ISO8601形式であれば分単位まで登録可能で、QueryAPIで分以下の単位で絞り込み可能です。
以下はサンプルサイトとCDEとして処理を行うLambdaのサンプルになります。
- インデックスのDate型のIndex Fieldとデータソースの仮のString型のField Mappingは省略
- 一時的に日時の情報を持つtmp_sample_datetimeをDate型のsample_datetimeに登録しています。

sample_datetime_site.html

・
・
<head>

  <meta name="test_datetime" content="2023-11-07T12:30:00Z"> 
  
  <title>Document</title>
</head>
.
.

sample_PreExtractionHookConfiguration.py

def lambda_handler(event, context):
    attributes = event["metadata"].get("attributes")
    updatedMeta = []

    for attribute in attributes:
        if attribute["name"] == "tmp_sample_datetime":
            updatedMeta.append(
                {
                    "name": "sample_datetime",
                    "value": {"dateValue": attribute["value"].get("stringValue")},
                }
            )

    return {
        "version": "v0",
        "s3ObjectKey": event.get("s3ObjectKey"),
        "metadataUpdates": updatedMeta,
    }

これで一旦登録は可能ですが、持ってる値がDateValueではなく、StringValueなのが気になるのが今後の課題です。。

String Listのデータタイプ

Crawler v2は同じnameを持つmetaタグがある場合、一つ目のものしか取得しない（後述）ため、以下の例のようにコンマで区切り、CDEで利用可能にしたのち、Lambdaにて再度更新をかける必要があります。

sample_string_list_site.html

.
.
<head>

  <meta name="test_string_list" content="test_string_list_1,test_string_list_2" />
  
  <title>Document</title>
</head>
.
.

sample_PreExtractionHookConfiguration.py

def lambda_handler(event, context):
    attributes = event["metadata"].get("attributes")
    updatedMeta = []

    for attribute in attributes:
        if attribute["name"] == "string_list":
            updatedMeta.append(
                {
                    "name": "categories",
                    "value": {
                        "stringListValue": attribute["value"]
                        .get("stringListValue")[0]
                        .split(",")
                    },
                }
            )

    return {
        "version": "v0",
        "s3ObjectKey": event.get("s3ObjectKey"),
        "metadataUpdates": updatedMeta,
    }

CDEで処理を行わない場合、以下のように分けて判定されない。

同じnameを持つmetaタグがある場合

同様のField Mappingでクロールした結果、一つ目のものしか取得されない。

sample_string_list_site_case.html

.
.
<head>

  <meta name="test_string_list" content="test_string_list_1" />
  <meta name="test_string_list" content="test_string_list_2" />
  
  <title>Document</title>
</head>
.
.

他に検証したがうまくいかなかったパターン

その1

<meta
  name="test_string_list"
  content="['test_string_list_1','test_string_list_2']"
/>

その2

<meta
  name="test_string_list"
  content="[test_string_list_1,test_string_list_2]"
/>

参考資料

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up