【Amazon Kendra】Web Crawler connector v2.0のIaCのやり方

Posted at 2023-12-20

はじめに

Japan AWS Jr. Champions Advent Calendar 21日目の記事です！

2023/12/21現在、Amazon Kendra のデータソースである Web Crawler connector v2.0はCDK、CloudFormation、SDKでサポートされていません。（Web Crawler connector v1.0は対応済）

そのため、CLIかマネジメントコンソールのいずれかで作成する必要がありますが、コードで管理可能なCLIで作成してみます。

お金もかかるサービスなので毎回作成→削除するのが面倒という方にもおすすめです。

インデックスの作成

インデックスの作成はCDK、CloudFormation、SDKでサポートされていますが、今回はインデックスIDが簡単に取得できると嬉しいのでCLIでやります。

$ aws kendra create-index --name index_name --role-arn role_arn_for_index
{
    "Id": "INDEX_ID"
}

データソース（Web Crawler connector v2.0）の作成

INDEX_ID, ACCOUNT_ID, ROLE_NAMEにはそれぞれ適切な値を入れてください。
webPageのfieldMappingsを設定する場合はインデックスのFacet definitionにて事前に設定する必要があります。
- 先頭に_がついているのはデフォルトで作成されるため、無視してください。

$ aws kendra create-data-source \
--cli-input-json file://template.json \
--type TEMPLATE \
--index-id {INDEX_ID} \

template.json

{
  "Name": "data_source_name",
  "Configuration": {
    "TemplateConfiguration": {
      "Template": {
        "connectionConfiguration": {
          "repositoryEndpointMetadata": {
            "s3SeedUrl": null,
            "siteMapUrls": [
              "https://example.com/sitemap.xml"
            ],
            "seedUrlConnections": null,
            "s3SiteMapUrl": null,
            "authentication": "NoAuthentication"
          }
        },
        "syncMode": "FORCED_FULL_CRAWL",
        "additionalProperties": {
          "rateLimit": "300",
          "maxFileSize": "50",
          "crawlDepth": "0",
          "crawlDomainsOnly": true,
          "crawlSubDomain": false,
          "crawlAllDomain": false,
          "inclusionURLCrawlPatterns": [],
          "exclusionURLCrawlPatterns": [],
          "inclusionURLIndexPatterns": [],
          "exclusionURLIndexPatterns": [],
          "inclusionFileIndexPatterns": [],
          "exclusionFileIndexPatterns": [],
          "proxy": {},
          "crawlAttachments": true,
          "honorRobots": true,
          "maxLinksPerUrl": "100"
        },
        "type": "WEBCRAWLERV2",
        "version": "1.0.0",
        "repositoryConfigurations": {
          "attachment": {
            "fieldMappings": [
              {
                "dataSourceFieldName": "category",
                "indexFieldName": "_category",
                "indexFieldType": "STRING"
              },
              {
                "dataSourceFieldName": "sourceUrl",
                "indexFieldName": "_source_uri",
                "indexFieldType": "STRING"
              }
            ]
          },
          "webPage": {
            "fieldMappings": [
              {
                "dataSourceFieldName": "category",
                "indexFieldName": "_category",
                "indexFieldType": "STRING"
              },
              {
                "dataSourceFieldName": "sourceUrl",
                "indexFieldName": "_source_uri",
                "indexFieldType": "STRING"
              },
              {
                "dataSourceFieldName": "string_list_field_name",
                "indexFieldName": "string_list_field",
                "indexFieldType": "STRING_LIST"
              },
              {
                "dataSourceFieldName": "string_field_name",
                "indexFieldName": "string_field",
                "indexFieldType": "STRING"
              }
            ]
          }
        }
      }
    }
  },
  "VpcConfiguration": {
    "SubnetIds": ["subnet-id_a", "subnet-id_b"],
    "SecurityGroupIds": ["sg-id_a", "sg-id_b"]
  },
  "Description": "description",
  "Schedule": "",
  "LanguageCode": "ja",
  "RoleArn": "role_arn_for_data_source",
  "CustomDocumentEnrichmentConfiguration": {
    "InlineConfigurations": [
      {
        "Condition": {
          "ConditionDocumentAttributeKey": "_source_uri",
          "Operator": "Contains",
          "ConditionOnValue": {
            "StringValue": "kendra"
          }
        },
        "Target": {
          "TargetDocumentAttributeKey": "_category",
          "TargetDocumentAttributeValueDeletion": false,
          "TargetDocumentAttributeValue": {
            "StringValue": "Kendra"
          }
        },
        "DocumentContentDeletion": false
      }
    ],
    "PreExtractionHookConfiguration": {
      "InvocationCondition": {
        "ConditionDocumentAttributeKey": "_source_uri",
        "Operator": "Contains",
        "ConditionOnValue": {
          "StringValue": "kendra"
        }
      },
      "LambdaArn": "lambda_arn_PreExtractionHookConfiguration",
      "S3Bucket": "s3_bucket"
    },
    "PostExtractionHookConfiguration": {
      "InvocationCondition": {
        "ConditionDocumentAttributeKey": "_source_uri",
        "Operator": "Contains",
        "ConditionOnValue": {
          "StringValue": "kendra"
        }
      },
      "LambdaArn": "lambda_arn_PostExtractionHookConfiguration",
      "S3Bucket": "s3_bucket"
    },
    "RoleArn": "role_arn_for_custom_document_enrichment"
  }
}

デフォルトの値で問題ないという場合は以下のようなjsonファイルでも良いです。

minimum_sample.json

{
    "Name": "data_source_name",
    "Configuration": {
      "TemplateConfiguration": {
        "Template": {
          "connectionConfiguration": {
            "repositoryEndpointMetadata": {
              "siteMapUrls": [
                "https://example.com/sitemap.xml"
              ]
            }
          },
          "syncMode": "FORCED_FULL_CRAWL",
          "additionalProperties": {
            "rateLimit": "300",
            "maxFileSize": "50",
            "crawlDepth": "0",
            "crawlDomainsOnly": true,
            "crawlSubDomain": false,
            "crawlAllDomain": false,
            "honorRobots": true,
            "maxLinksPerUrl": "100"
          },
          "type": "WEBCRAWLERV2",
        }
      }
    },
    "RoleArn": "role_arn_for_data_source"
  }

データソースの設定がわからない場合は、手動で作成した後にdescribe-data-sourceCLIを実行して確認してください。
ついでにjsonに書き出すと作成する際に楽になります。

$ aws kendra describe-data-source --id DATA_SOURCE_ID --index-id INDEX_ID > sample.json

削除

$ aws kendra delete-index --id INDEX_ID

参考資料

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up