AWS CDKカスタムリソースで実現するエンタープライズ生成AIアーキテクチャ - GenUの実装から学ぶ

Last updated at 2025-10-27Posted at 2025-10-27

AWS CDKカスタムリソースの実践パターン - GenUで学ぶエンタープライズ生成AIインフラ

はじめに

生成AIアプリケーションを本番環境で運用する際、インフラ構築は重要な課題です。AWS CDKを使えばインフラをコードで管理できますが、以下のような壁に直面することがあります：

新サービスの未対応: Amazon Bedrock AgentCoreなど、CloudFormationがまだ対応していないサービス
細かい制御の限界: OpenSearch Serverlessのインデックス設定など、リソース内部の詳細な制御
運用の自動化: タグ付与のタイミング制御など、デプロイ時の運用処理

本記事では、AWSの生成AI活用リファレンスアーキテクチャ「GenU」の実装から、これらの課題を解決するカスタムリソースの実践パターンを学びます。

本記事で扱う GenU は AWS が公開している OSS プロジェクトです。
https://github.com/aws-samples/generative-ai-use-cases

GenU のセットアップ手順については、以下のブログ記事をご参照ください。
https://www.fsi.co.jp/blog/12335/

本記事で学べること

新サービス対応パターン: Bedrock AgentCore Runtimeの動的プロビジョニング
リソース内部制御パターン: OpenSearchインデックスのカスタマイズ
運用自動化パターン: タグ管理の自動化
エンタープライズ実装: べき等性、エラーハンドリング、Singletonパターン

カスタムリソースとは？

CloudFormationの限界

AWS CloudFormation（CFn）は、インフラをコードで管理する強力なサービスですが、以下の制約があります：

制約	具体例	影響
新サービスの未対応期間	Bedrock AgentCore（2025年1月時点）	最新機能が使えない
リソース内部の制御不可	OpenSearch Serverlessのインデックス設定	詳細なカスタマイズができない
運用処理の組み込み困難	条件付きタグ付与	デプロイ時の柔軟な処理ができない

カスタムリソースの仕組み

カスタムリソースは、CloudFormationの機能を拡張し、任意のAWS APIやカスタムロジックを実行できる仕組みです。

CDKでの実装パターン

AWS CDKでは、カスタムリソースを3つのコンポーネントで実装します：

// packages/cdk/lib/construct/generic-agent-core.ts

// 1. Lambda関数（実際の処理を実行）
const handler = new NodejsFunction(this, 'Handler', {
  entry: './custom-resources/handler.ts',
  handler: 'handler',
  timeout: Duration.minutes(10),
});

// 2. Provider（CloudFormationとLambdaの橋渡し）
const provider = new Provider(this, 'Provider', {
  onEventHandler: handler,
});

// 3. CustomResource（CloudFormationリソースとして定義）
new CustomResource(this, 'MyCustomResource', {
  serviceToken: provider.serviceToken,
  properties: {
    ResourceName: 'my-resource',
    Configuration: { key: 'value' },
  },
});

各コンポーネントの役割:

コンポーネント	役割	実装場所
Lambda関数	Create/Update/Delete処理の実装	`custom-resources/*/index.ts`
Provider	CloudFormationとの通信を管理	CDK Construct
CustomResource	CloudFormationリソースとして定義	CDK Construct

パターン1: 新サービス対応（AgentCore Runtime）

なぜ必要か

Amazon Bedrock AgentCoreは、2024年12月のre:Inventで発表された新しいサービスです。エージェントランタイムをコンテナで実行できる強力な機能ですが、2025年1月時点ではCloudFormationがまだ対応していません。

GenUでは、この新サービスを使うために、カスタムリソースでAgentCore Runtimeを動的にプロビジョニングしています。

AgentCore Runtimeとは:

Dockerコンテナでエージェントロジックを実行
Bedrock APIと統合
スケーラブルな生成AIアプリケーション基盤

実装の全体像

AgentCore Runtimeのカスタムリソースは、以下の構成で実装されています：

なぜ2つのIAMロールが必要なのか

AgentCore Runtimeでは、2つの異なるIAMロールを使い分けています：

CustomResource Lambda Role:

AgentCore Runtimeの作成/更新/削除
Runtime Roleを渡す権限（PassRole）

AgentCore Runtime Role:

Bedrockモデルの呼び出し
S3へのファイルアクセス
CloudWatch Logsへの書き込み

この分離により、最小権限の原則を実現しています。CustomResource Lambdaは、実行時に必要な権限（Bedrock呼び出しなど）を持つ必要がありません。

実装ファイル:

CDK: packages/cdk/lib/construct/generic-agent-core.ts
Lambda: packages/cdk/custom-resources/agent-core-runtime/index.ts

CDK側の実装詳細

Docker Imageの作成

// packages/cdk/lib/construct/generic-agent-core.ts

private createDockerImageAsset(): {
  repository: Repository;
  imageUri: string;
} {
  const dockerPath = 'lambda-python/generic-agent-core-runtime';
  
  // ECRリポジトリの作成
  const repository = new Repository(this, 'AgentCoreRuntimeRepository', {
    repositoryName: `generic-agent-core-runtime-${Stack.of(this).stackName.toLowerCase()}`,
    imageScanOnPush: true,
    removalPolicy: RemovalPolicy.DESTROY,
  });

  // DockerイメージをビルドしてECRにプッシュ
  const dockerAsset = new DockerImageAsset(
    this,
    'AgentCoreRuntimeDockerAsset',
    {
      directory: path.join(__dirname, `../../${dockerPath}`),
      platform: Platform.LINUX_ARM64, // AgentCoreはARM64プラットフォーム
    }
  );

  return {
    repository,
    imageUri: dockerAsset.imageUri,
  };
}

ポイント:

Platform.LINUX_ARM64: AgentCoreはARM64アーキテクチャで動作
imageScanOnPush: true: セキュリティスキャンを有効化
DockerImageAsset: CDKが自動的にビルド・プッシュ

Lambda側の実装

Create処理の詳細

// packages/cdk/custom-resources/agent-core-runtime/index.ts

import {
  BedrockAgentCoreControlClient,
  CreateAgentRuntimeCommand,
} from '@aws-sdk/client-bedrock-agentcore-control';

async function createAgentRuntime(
  client: BedrockAgentCoreControlClient,
  config: AgentCoreRuntimeConfig
): Promise<{ agentRuntimeId: string; agentRuntimeArn: string }> {
  console.log(`Creating AgentCore Runtime: ${config.name}`);

  const createParams = {
    agentRuntimeName: config.name,
    agentRuntimeArtifact: {
      containerConfiguration: {
        containerUri: config.containerImageUri,
      },
    },
    roleArn: config.roleArn,
    networkConfiguration: {
      networkMode: config.networkMode === 'DEFAULT' ? 'PUBLIC' : config.networkMode,
    },
    protocolConfiguration: {
      serverProtocol: config.serverProtocol,
    },
    ...(config.environmentVariables && {
      environmentVariables: config.environmentVariables,
    }),
  };

  const command = new CreateAgentRuntimeCommand(createParams);
  const response = await client.send(command);

  if (!response.agentRuntimeId || !response.agentRuntimeArn) {
    throw new Error('Failed to create AgentCore Runtime - missing ID or ARN in response');
  }

  return {
    agentRuntimeId: response.agentRuntimeId,
    agentRuntimeArn: response.agentRuntimeArn,
  };
}

実装のポイント:

詳細なログ出力: トラブルシューティングのため
レスポンスの検証: 必須フィールドの存在確認
エラーハンドリング: 明確なエラーメッセージ

PhysicalResourceIdの重要性

// packages/cdk/custom-resources/agent-core-runtime/index.ts

return {
  PhysicalResourceId: agentRuntimeId, // ← これが重要
  Data: {
    AgentCoreRuntimeId: agentRuntimeId,
    AgentCoreRuntimeArn: agentRuntimeArn,
  },
};

PhysicalResourceIdの役割:

シナリオ	PhysicalResourceId	動作
初回Create	`runtime-abc123`	新規作成
Update（IDが同じ）	`runtime-abc123`	既存リソースを更新
Update（IDが変わる）	`runtime-xyz789`	古いリソースを削除→新規作成

実装での注意点:

// packages/cdk/custom-resources/agent-core-runtime/index.ts

// ❌ 悪い例：毎回異なるIDを返す
PhysicalResourceId: `runtime-${Date.now()}`
// → Update時に毎回リソースが置き換わる

// ✅ 良い例：一意で不変なIDを返す
PhysicalResourceId: response.agentRuntimeId
// → Update時に既存リソースを更新

パターン2: リソース内部制御（OpenSearch Index）

なぜ必要か

Amazon OpenSearch Serverlessは、CloudFormationで**Collection（コレクション）を作成できますが、その内部のIndex（インデックス）**設定は制御できません。

GenUのRAG機能では、以下の要件があります：

knn_vector: ベクトル検索用のフィールド設定
日本語アナライザー: Kuromojiを使った日本語の形態素解析
カスタム設定: 次元数、距離関数、バイナリベクトル対応

これらはOpenSearch APIを直接呼び出す必要があるため、カスタムリソースで実装します。

CloudFormationの限界

なぜProvider Frameworkを使わないのか

OpenSearch Indexでは、手動でレスポンスを送信しています。Provider Frameworkを使わない理由：

比較表:

項目	手動実装	Provider Framework
実装の複雑さ	高	低
レスポンス送信	自分で実装	自動
待機処理	`sleep(60秒)`	`isComplete`でポーリング
Lambda実行回数	1回	N回（ポーリング）
コスト	低	高（ポーリング分）

GenUの判断:

待機時間が固定（60秒）
処理がシンプル
コスト最適化を優先

Provider Frameworkを使うべきケース:

完了時間が不定（数分〜数十分）
複雑なエラーハンドリングが必要
複数のリソースを順次作成

Lambda関数の実装

Create処理：Index作成

// packages/cdk/custom-resources/opensearch-index/oss-index.js

case 'Create':
  const vectorDimension = Number(props.vectorDimension);
  const ragKnowledgeBaseBinaryVector =
    props.ragKnowledgeBaseBinaryVector.toLowerCase() === 'true';
  
  await client.indices.create({
    index: props.vectorIndexName,
    body: {
      mappings: {
        properties: {
          // メタデータフィールド（検索対象外）
          [props.metadataField]: {
            type: 'text',
            index: false,
          },
          // テキストフィールド（日本語検索対応）
          [props.textField]: {
            type: 'text',
            analyzer: 'custom_kuromoji_analyzer',
          },
          // ベクトルフィールド（knn検索用）
          [props.vectorField]: {
            type: 'knn_vector',
            dimension: vectorDimension,
            ...(ragKnowledgeBaseBinaryVector
              ? { data_type: 'binary' }
              : {}),
            method: {
              engine: 'faiss',
              space_type: ragKnowledgeBaseBinaryVector ? 'hamming' : 'l2',
              name: 'hnsw',
              parameters: {},
            },
          },
        },
      },
      settings: {
        index: {
          knn: true,
          analysis: {
            analyzer: {
              custom_kuromoji_analyzer: {
                type: 'custom',
                tokenizer: 'kuromoji_tokenizer',
                filter: [
                  'kuromoji_baseform',
                  'kuromoji_part_of_speech',
                  'kuromoji_stemmer',
                  'lowercase',
                  'ja_stop',
                ],
                char_filter: [
                  'kuromoji_iteration_mark',
                  'icu_normalizer',
                  'html_strip',
                ],
              },
            },
          },
        },
      },
    },
  });
  
  await sleep(60 * 1000); // 60秒待機
  break;

設定の詳細:

knn_vector設定

{
  type: 'knn_vector',
  dimension: vectorDimension,        // ベクトルの次元数（例: 1024）
  data_type: 'binary',               // バイナリベクトル（オプション）
  method: {
    engine: 'faiss',                 // FAISSエンジン使用
    space_type: 'hamming',           // 距離関数（hamming or l2）
    name: 'hnsw',                    // HNSWアルゴリズム
    parameters: {},
  },
}

日本語アナライザー設定

{
  analyzer: {
    custom_kuromoji_analyzer: {
      type: 'custom',
      tokenizer: 'kuromoji_tokenizer',      // 形態素解析
      filter: [
        'kuromoji_baseform',                // 基本形に変換
        'kuromoji_part_of_speech',          // 品詞フィルタ
        'kuromoji_stemmer',                 // ステミング
        'lowercase',                        // 小文字化
        'ja_stop',                          // ストップワード除去
      ],
      char_filter: [
        'kuromoji_iteration_mark',          // 踊り字正規化
        'icu_normalizer',                   // Unicode正規化
        'html_strip',                       // HTMLタグ除去
      ],
    },
  },
}

待機処理の実装

const sleep = (msec) => new Promise((resolve) => setTimeout(resolve, msec));

await client.indices.create({ ... });
await sleep(60 * 1000); // 60秒待機

なぜ待機が必要？

OpenSearch Serverlessは非同期でIndexを作成するため、即座に完了しません。待機処理により、次の処理（データ投入など）が安全に実行できます。

レスポンス送信に失敗した場合の影響

// packages/cdk/custom-resources/opensearch-index/oss-index.js

request.on('error', (error) => {
  console.log('send() error:', error);
  resolve(); // ← なぜreject()ではなくresolve()？
});

動作の違い:

実装の意図:

reject(): Lambda自体がエラーになり、CloudFormationが1時間待つ
resolve(): Lambdaは成功するが、CloudFormationは別のタイムアウト機構で検知
より早くエラーを検知できる

パターン3: 運用自動化（Apply Tags）

なぜ必要か

OpenSearch Serverlessへのタグ付与は、CloudFormationでも可能です。しかし、GenUでは以下の要件があります：

🏷️ 条件付きタグ付与: 特定の条件でのみタグを付与
🔄 動的なタグ管理: タグ値の有無による追加/削除の切り替え
⏱️ タイミング制御: Collection作成後の適切なタイミングでタグ付与

これらの柔軟な制御を実現するため、カスタムリソースで実装します。

ユースケース

コスト管理の自動化:

エンタープライズパターン

これまで3つのカスタムリソースパターンを見てきました。このセクションでは、本番環境で運用する際に重要なエンタープライズレベルの実装パターンを解説します。

6.1 Provider Frameworkの使い分け

AWS CDKには、カスタムリソースを簡単に実装できるProvider Frameworkが用意されています。

Provider Frameworkとは

6.2 実際のトラブル事例と対処法

事例1: Delete時のリソース削除失敗

// packages/cdk/custom-resources/agent-core-runtime/index.ts

// ❌ 問題のあるコード
async function deleteAgentRuntime(client, agentRuntimeId) {
  await client.send(new DeleteAgentRuntimeCommand({ agentRuntimeId }));
  // エラーが発生すると、CloudFormationがスタックしたまま
}

発生した問題:

AgentCore Runtimeが手動で削除されていた
CloudFormationのDelete時にResourceNotFoundException
スタックがDELETE_FAILED状態で停止
手動でのスタック削除が必要に

解決策:

// packages/cdk/custom-resources/agent-core-runtime/index.ts

// ✅ べき等性を実装
async function deleteAgentRuntime(client, agentRuntimeId) {
  try {
    await client.send(new DeleteAgentRuntimeCommand({ agentRuntimeId }));
  } catch (error) {
    if (error.name === 'ResourceNotFoundException') {
      console.log('Already deleted, continuing');
      return; // 成功として扱う
    }
    throw error;
  }
}

事例2: Update時の無限ループ

// packages/cdk/custom-resources/agent-core-runtime/index.ts

// ❌ 問題のあるコード
export async function handler(event) {
  if (event.RequestType === 'Update') {
    const newId = await createNewResource(); // 新しいIDを生成
    return {
      PhysicalResourceId: newId, // ← IDが変わる
    };
  }
}

発生した問題:

Update時に新しいPhysicalResourceIdを返す
CloudFormationがリソースの置き換えと判断
古いリソースのDelete → 新しいリソースのCreate
再度Update → 無限ループ

解決策:

// packages/cdk/custom-resources/agent-core-runtime/index.ts

// ✅ PhysicalResourceIdを維持
export async function handler(event) {
  if (event.RequestType === 'Update') {
    const existingId = event.PhysicalResourceId;
    await updateExistingResource(existingId);
    return {
      PhysicalResourceId: existingId, // ← 同じIDを返す
    };
  }
}

6.3 Singletonパターンの落とし穴

問題: スコープの間違い

// packages/cdk/lib/construct/generic-agent-core.ts

// ❌ 悪い例：Constructスコープ
export class MyConstruct extends Construct {
  constructor(scope: Construct, id: string) {
    super(scope, id);
    
    // このConstructごとに別々のLambda関数が作成される
    const handler = new NodejsFunction(this, 'Handler', { ... });
  }
}

結果:

Stack
├── Construct1
│   └── Handler (Lambda 1)
├── Construct2
│   └── Handler (Lambda 2)  ← 重複！
└── Construct3
    └── Handler (Lambda 3)  ← 重複！

解決策:

// packages/cdk/lib/construct/generic-agent-core.ts

// ✅ 良い例：Stackスコープ
export class MyConstruct extends Construct {
  constructor(scope: Construct, id: string) {
    super(scope, id);
    
    const stack = Stack.of(this);
    const singletonId = 'Singleton-Handler';
    
    // Stack全体で1つのLambda関数を共有
    const existing = stack.node.tryFindChild(singletonId);
    const handler = existing || new NodejsFunction(stack, singletonId, { ... });
  }
}

結果:

Stack
├── Singleton-Handler (Lambda 1)  ← 1つだけ
├── Construct1 (Handlerを参照)
├── Construct2 (Handlerを参照)
└── Construct3 (Handlerを参照)

実装結果と動作確認

ここまで学んだカスタムリソースを、実際にデプロイして動作を確認します。

7.1 デプロイの準備

前提条件

AWS CLIの設定完了
Node.js 18以上
Docker（AgentCore Runtime用）
適切なIAM権限

GenUのセットアップ

# リポジトリのクローン
git clone https://github.com/aws-samples/generative-ai-use-cases.git
cd generative-ai-use-cases

# 依存関係のインストール
npm ci

7.2 デプロイの実行

# CDKのブートストラップ（初回のみ）
npm run cdk:bootstrap

# デプロイ
npm run cdk:deploy -- --require-approval never

デプロイ時間の目安:

リソース	時間	理由
Docker Image Build	5-10分	AgentCore Runtimeのイメージビルド
AgentCore Runtime	2-3分	Bedrock APIの呼び出し
OpenSearch Collection	10-15分	Serverlessの起動
OpenSearch Index	1-2分	Index作成 + 60秒待機
その他のリソース	5-10分	Lambda、API Gateway等
合計	約30-40分

7.3 デプロイ結果の確認

CloudFormationスタックの確認

# スタックの状態確認
aws cloudformation describe-stacks \
  --stack-name GenerativeAiUseCasesStack \
  --query 'Stacks[0].StackStatus'

# 出力: "CREATE_COMPLETE"

7.4 AgentCore Runtimeの確認

# AgentCore Runtimeの一覧取得
aws bedrock-agentcore list-agent-runtimes \
  --region us-west-2 \
  --query 'agentRuntimeSummaries[?contains(agentRuntimeName, `GenericAgentCoreRuntime`)].{Name:agentRuntimeName,Id:agentRuntimeId,Status:status}' \
  --output table

7.5 CloudWatch Logsでの確認

# ロググループ名を変数に格納
LOG_GROUP=$(aws logs describe-log-groups \
  --log-group-name-prefix /aws/lambda/GenerativeAiUseCasesStack-GenericAgentCore \
  --query 'logGroups[0].logGroupName' \
  --output text)

# ログをリアルタイムで監視
aws logs tail $LOG_GROUP --follow

7.6 トラブルシューティング

よくあるエラーと対処法

1. AgentCore Runtime作成エラー

Error: Failed to create AgentCore Runtime - missing ID or ARN in response

対処法:

# 対応リージョンの確認
aws bedrock-agentcore list-agent-runtimes --region us-west-2

# IAM権限の確認
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/CustomResourceRole \
  --action-names bedrock-agentcore:CreateAgentRuntime

2. OpenSearch Index作成タイムアウト

Error: Lambda timeout after 15 minutes

対処法:

// Lambda関数のタイムアウトを延長
const createIndexFunction = new NodejsFunction(this, 'CreateIndex', {
  timeout: Duration.minutes(20), // 15分 → 20分
});

7.10 本番環境での運用考慮事項

デプロイ戦略

Blue/Greenデプロイの考慮:

カスタムリソースでの注意点:

Update時にリソースが置き換わる可能性
PhysicalResourceIdの変更 = リソース再作成
ダウンタイムの発生

推奨アプローチ:

// packages/cdk/bin/generative-ai-use-cases.ts

// 環境ごとに別々のスタックを作成
const blueStack = new GenerativeAiUseCasesStack(app, 'Blue', { env: 'blue' });
const greenStack = new GenerativeAiUseCasesStack(app, 'Green', { env: 'green' });

モニタリングとアラート

CloudWatch Alarmの設定:

// packages/cdk/lib/construct/generic-agent-core.ts

// カスタムリソースのエラーを監視
const errorAlarm = new Alarm(this, 'CustomResourceError', {
  metric: customResourceFunction.metricErrors(),
  threshold: 1,
  evaluationPeriods: 1,
  alarmDescription: 'Custom Resource execution failed',
});

// SNSトピックに通知
errorAlarm.addAlarmAction(new SnsAction(topic));

監視すべきメトリクス:

メトリクス	閾値	アクション
Lambda Errors	> 0	即座に通知
Lambda Duration	> 10分	警告
Lambda Throttles	> 0	スケーリング検討

まとめ

他のAWSサービスへの応用

本記事のパターンは、以下のサービスにも応用できます：

サービス	ユースケース	パターン
Amazon SageMaker	エンドポイント設定の詳細制御	リソース内部制御
AWS Step Functions	複雑なワークフローの動的生成	新サービス対応
Amazon EventBridge	ルールの条件付き作成	運用自動化
AWS Glue	データカタログの動的管理	リソース内部制御

実装パターンの選択フローチャート

カスタムリソースを実装する際の判断基準：

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up