13
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

【Amazon Kendra】Web Crawler connector v2.0のIaCのやり方

Posted at

はじめに

Japan AWS Jr. Champions Advent Calendar 21日目の記事です!

2023/12/21現在、Amazon Kendra のデータソースである Web Crawler connector v2.0はCDK、CloudFormation、SDKでサポートされていません。(Web Crawler connector v1.0は対応済)

そのため、CLIマネジメントコンソールのいずれかで作成する必要がありますが、コードで管理可能なCLIで作成してみます。

お金もかかるサービスなので毎回作成→削除するのが面倒という方にもおすすめです。

インデックスの作成

  • インデックスの作成はCDK、CloudFormation、SDKでサポートされていますが、今回はインデックスIDが簡単に取得できると嬉しいのでCLIでやります。
$ aws kendra create-index --name index_name --role-arn role_arn_for_index
{
    "Id": "INDEX_ID"
}

データソース(Web Crawler connector v2.0)の作成

  • INDEX_ID, ACCOUNT_ID, ROLE_NAMEにはそれぞれ適切な値を入れてください。
  • webPagefieldMappingsを設定する場合はインデックスのFacet definitionにて事前に設定する必要があります。
    • 先頭に_がついているのはデフォルトで作成されるため、無視してください。
$ aws kendra create-data-source \
--cli-input-json file://template.json \
--type TEMPLATE \
--index-id {INDEX_ID} \
template.json
{
  "Name": "data_source_name",
  "Configuration": {
    "TemplateConfiguration": {
      "Template": {
        "connectionConfiguration": {
          "repositoryEndpointMetadata": {
            "s3SeedUrl": null,
            "siteMapUrls": [
              "https://example.com/sitemap.xml"
            ],
            "seedUrlConnections": null,
            "s3SiteMapUrl": null,
            "authentication": "NoAuthentication"
          }
        },
        "syncMode": "FORCED_FULL_CRAWL",
        "additionalProperties": {
          "rateLimit": "300",
          "maxFileSize": "50",
          "crawlDepth": "0",
          "crawlDomainsOnly": true,
          "crawlSubDomain": false,
          "crawlAllDomain": false,
          "inclusionURLCrawlPatterns": [],
          "exclusionURLCrawlPatterns": [],
          "inclusionURLIndexPatterns": [],
          "exclusionURLIndexPatterns": [],
          "inclusionFileIndexPatterns": [],
          "exclusionFileIndexPatterns": [],
          "proxy": {},
          "crawlAttachments": true,
          "honorRobots": true,
          "maxLinksPerUrl": "100"
        },
        "type": "WEBCRAWLERV2",
        "version": "1.0.0",
        "repositoryConfigurations": {
          "attachment": {
            "fieldMappings": [
              {
                "dataSourceFieldName": "category",
                "indexFieldName": "_category",
                "indexFieldType": "STRING"
              },
              {
                "dataSourceFieldName": "sourceUrl",
                "indexFieldName": "_source_uri",
                "indexFieldType": "STRING"
              }
            ]
          },
          "webPage": {
            "fieldMappings": [
              {
                "dataSourceFieldName": "category",
                "indexFieldName": "_category",
                "indexFieldType": "STRING"
              },
              {
                "dataSourceFieldName": "sourceUrl",
                "indexFieldName": "_source_uri",
                "indexFieldType": "STRING"
              },
              {
                "dataSourceFieldName": "string_list_field_name",
                "indexFieldName": "string_list_field",
                "indexFieldType": "STRING_LIST"
              },
              {
                "dataSourceFieldName": "string_field_name",
                "indexFieldName": "string_field",
                "indexFieldType": "STRING"
              }
            ]
          }
        }
      }
    }
  },
  "VpcConfiguration": {
    "SubnetIds": ["subnet-id_a", "subnet-id_b"],
    "SecurityGroupIds": ["sg-id_a", "sg-id_b"]
  },
  "Description": "description",
  "Schedule": "",
  "LanguageCode": "ja",
  "RoleArn": "role_arn_for_data_source",
  "CustomDocumentEnrichmentConfiguration": {
    "InlineConfigurations": [
      {
        "Condition": {
          "ConditionDocumentAttributeKey": "_source_uri",
          "Operator": "Contains",
          "ConditionOnValue": {
            "StringValue": "kendra"
          }
        },
        "Target": {
          "TargetDocumentAttributeKey": "_category",
          "TargetDocumentAttributeValueDeletion": false,
          "TargetDocumentAttributeValue": {
            "StringValue": "Kendra"
          }
        },
        "DocumentContentDeletion": false
      }
    ],
    "PreExtractionHookConfiguration": {
      "InvocationCondition": {
        "ConditionDocumentAttributeKey": "_source_uri",
        "Operator": "Contains",
        "ConditionOnValue": {
          "StringValue": "kendra"
        }
      },
      "LambdaArn": "lambda_arn_PreExtractionHookConfiguration",
      "S3Bucket": "s3_bucket"
    },
    "PostExtractionHookConfiguration": {
      "InvocationCondition": {
        "ConditionDocumentAttributeKey": "_source_uri",
        "Operator": "Contains",
        "ConditionOnValue": {
          "StringValue": "kendra"
        }
      },
      "LambdaArn": "lambda_arn_PostExtractionHookConfiguration",
      "S3Bucket": "s3_bucket"
    },
    "RoleArn": "role_arn_for_custom_document_enrichment"
  }
}

デフォルトの値で問題ないという場合は以下のようなjsonファイルでも良いです。

minimum_sample.json
{
    "Name": "data_source_name",
    "Configuration": {
      "TemplateConfiguration": {
        "Template": {
          "connectionConfiguration": {
            "repositoryEndpointMetadata": {
              "siteMapUrls": [
                "https://example.com/sitemap.xml"
              ]
            }
          },
          "syncMode": "FORCED_FULL_CRAWL",
          "additionalProperties": {
            "rateLimit": "300",
            "maxFileSize": "50",
            "crawlDepth": "0",
            "crawlDomainsOnly": true,
            "crawlSubDomain": false,
            "crawlAllDomain": false,
            "honorRobots": true,
            "maxLinksPerUrl": "100"
          },
          "type": "WEBCRAWLERV2",
        }
      }
    },
    "RoleArn": "role_arn_for_data_source"
  }

データソースの設定がわからない場合は、手動で作成した後にdescribe-data-sourceCLIを実行して確認してください。
ついでにjsonに書き出すと作成する際に楽になります。

$ aws kendra describe-data-source --id DATA_SOURCE_ID --index-id INDEX_ID > sample.json

削除

$ aws kendra delete-index --id INDEX_ID

参考資料

13
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
13
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?