Fargate Spot で月$42K が $21K になった話 - イベント駆動型フェイルオーバー実装

Last updated at 2026-01-24Posted at 2026-01-24

IoT プラットフォームの分析基盤を Fargate で運用していたら、月額コストが $42K に到達してしまった。Spot に切り替えたら 49% 削減できたが、キャパシティ不足でサービス停止という新たな地獄が待っていた。

最終解:EventBridge + Lambda でイベント駆動型フェイルオーバーを実装。Spot が死んだら自動で Standard に切り替え、復旧したら戻す。結果:

コスト削減:49%($42K → $21K/月)
サービス可用性:99.9% 維持
運用負荷:ゼロ(完全自動化)

この記事では、アーキテクチャ設計、実装の詳細、踏んだ地雷、実測データを全て共有する。

問題発生:Standard Fargate のコストが爆発

背景

IoT 分析基盤の要件:

大量並列処理: ピーク時に 200+ タスク同時実行
長時間実行: 各タスクが平均 5分、最大 30分稼働
24/7 稼働: デバイスから分単位でログが来るので止められない

初期構成:

デバイス → SQS → ECS Fargate(Standard) → 分析処理
                  ↑
                 200タスク常時稼働

月末の請求書:

ECS Fargate(Standard):
- vCPU: 4 vCPU × 200タスク × 720時間 × $0.04656 = $26,813
- Memory: 16GB × 200タスク × 720時間 × $0.00511 = $11,774

合計: $38,587/月

サービス主管部門のコメント:「このまま行くと年間 $460K。何とかしてくれ」

なぜ Spot なのか?

料金比較:

Standard:
4 vCPU + 16GB = $0.59/時間

Spot:
4 vCPU + 16GB = $0.26/時間

差額: $0.33/時間 = 56% 削減

計算してみた:

200タスク × 720時間 × $0.33 = $47,520/月 の削減ポテンシャル

サービス主管部門:「今すぐやれ」

第一次試行:純粋 Spot 化の失敗

実装

シンプルに ECS サービスを Spot に切り替え:

SpotService:
  Type: AWS::ECS::Service
  Properties:
    Cluster: !Ref AnalyticsCluster
    DesiredCount: 200
    CapacityProviderStrategy:
      - CapacityProvider: FARGATE_SPOT
        Weight: 1

初日:コストが半分になって大喜び

3日目:地獄の始まり

遭遇した問題

問題 1: ResourcesNotAvailable 地獄

[ERROR] ECS Task failed to start
StoppedReason: "ResourcesNotAvailable: No spot capacity available"

影響:
- 50タスクが起動失敗
- SQS にメッセージが堆積(15,000+ メッセージ)
- ダウンストリームサービスがタイムアウト

頻度: 週に 2-3回、ピーク時間帯(日本時間 9-12時)に集中

問題 2: SpotInterruption サプライズ

[WARN] ECS Task interrupted
StoppedReason: "SpotInterruption: Task interrupted due to Spot reclamation"

影響:
- 処理中のタスクが突然終了
- 分析データが中途半端な状態で残る
- リトライ処理で同じデータを二重処理

頻度: 1日に 2-3回、予測不可能

問題 3: PENDING 地獄

最悪のケース:

タスク起動要求 → PENDING 状態で 10分間停滞 → 最終的に失敗

CloudWatch Logs:
[09:15:23] Task requested
[09:15:24] Task status: PENDING
[09:15:25] Task status: PENDING
...
[09:25:23] Task status: PENDING (10分経過!)
[09:25:24] Task stopped: ResourcesNotAvailable

実測データ(1週間の観測):

正常起動: 平均 1.8分
失敗時 PENDING: 平均 8.3分
最長 PENDING: 23分 😱

サービス主管部門の反応

「コストは削減できたが、サービスが止まるのはNG。元に戻せ」

解決策:イベント駆動型自動フェイルオーバー

コアアイデア

Spot をメインに使いつつ、死んだら Standard に自動切り替え

通常時(95%の時間):
  Spot Service: 200タスク 🟢
  Standard Service: 0タスク 💤

Spot死亡時(5%の時間):
  Spot Service: 0タスク 💀
  Standard Service: 200タスク 🟢 (自動起動)

回復時:
  Spot Service: 200タスク 🟢 (自動復帰)
  Standard Service: 0タスク 💤

アーキテクチャ全体像

┌──────────── ECS Cluster ────────────┐
│                                      │
│  [Spot Service]  [Standard Service]  │
│   desired: 200    desired: 0         │
│   RUNNING 🟢      STOPPED 💤         │
│                                      │
└────────┬────────────────┬────────────┘
         │                │
         ▼                ▼
    ┌────────────────────────────┐
    │   EventBridge              │
    │   - STOPPED イベント        │
    │   - RUNNING イベント        │
    │   - PENDING 長期化          │
    └────────┬───────────────────┘
             │
             ▼
    ┌────────────────────────────┐
    │   Lambda Functions         │
    │   1️⃣ Error Detector       │
    │   2️⃣ Failover Executor    │
    │   3️⃣ Recovery Monitor     │
    │   4️⃣ Cleanup Handler      │
    └────────┬───────────────────┘
             │
             ▼
    ┌────────────────────────────┐
    │   DynamoDB State Store     │
    │   - error_count: 2         │
    │   - failover_active: true  │
    │   - original_count: 200    │
    └────────────────────────────┘

実装詳細:コードで見る仕組み

1. エラー検出器:Spot の死を察知

EventBridge Rule:

{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Task State Change"],
  "detail": {
    "clusterArn": ["arn:aws:ecs:*:*:cluster/AnalyticsCluster"],
    "lastStatus": ["STOPPED"],
    "capacityProviderName": ["FARGATE_SPOT"],
    "stoppedReason": [
      {"prefix": "ResourcesNotAvailable"},
      {"prefix": "SpotInterruption"}
    ]
  }
}

重要:capacityProviderName でフィルタリングして、Spot タスクだけ監視。Standard タスクのエラーは無視。

Lambda 処理:

import boto3
import time
from datetime import datetime

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('FargateFailoverState')

def lambda_handler(event, context):
    """
    Spot タスクのエラーを検出してカウント
    """
    detail = event['detail']
    service_name = extract_service_name(detail)
    stopped_reason = detail.get('stoppedReason', '')
    
    # エラー重要度判定
    if 'ResourcesNotAvailable' in stopped_reason:
        severity = 'CRITICAL'  # キャパシティ完全枯渇
        threshold = 3  # 3回でフェイルオーバー
    elif 'SpotInterruption' in stopped_reason:
        severity = 'HIGH'  # 割り込み
        threshold = 5  # 5回でフェイルオーバー
    else:
        return  # その他は無視
    
    # DynamoDB でエラーカウント
    response = table.update_item(
        Key={'service_name': service_name},
        UpdateExpression='ADD error_count :inc SET last_error_time = :time',
        ExpressionAttributeValues={
            ':inc': 1,
            ':time': int(time.time())
        },
        ReturnValues='UPDATED_NEW'
    )
    
    error_count = response['Attributes']['error_count']
    
    print(f"⚠️ Spot error detected: {stopped_reason}")
    print(f"   Error count: {error_count}/{threshold}")
    
    # 閾値チェック
    if error_count >= threshold:
        print(f"🚨 Threshold exceeded! Triggering failover...")
        trigger_failover(service_name, severity)
    
    return {'statusCode': 200, 'error_count': error_count}

def extract_service_name(detail):
    """
    ECS イベントからサービス名を抽出
    例: "service:analytics-spot" → "analytics"
    """
    group = detail.get('group', '')
    return group.split(':')[1].replace('-spot', '')

工夫したポイント:

エラータイプ別の閾値:
- ResourcesNotAvailable: 3回 → 深刻、即フェイルオーバー
- SpotInterruption: 5回 → 一時的、様子見

時間窓ベースのカウント:

# 5分以内のエラーのみカウント(古いエラーは除外)
current_time = int(time.time())
if current_time - last_error_time > 300:
    error_count = 0  # リセット

2. PENDING タイムアウト検出

最重要機能:タスクが PENDING で詰まっているのを検出

CloudWatch Events Rule(毎分実行):

def scan_pending_tasks(event, context):
    """
    定期的に PENDING タスクをスキャン
    """
    ecs = boto3.client('ecs')
    
    # Spot サービスのタスク一覧を取得
    tasks = ecs.list_tasks(
        cluster='AnalyticsCluster',
        serviceName='analytics-spot',
        desiredStatus='RUNNING'
    )
    
    if not tasks['taskArns']:
        return
    
    # タスク詳細を取得
    task_details = ecs.describe_tasks(
        cluster='AnalyticsCluster',
        tasks=tasks['taskArns']
    )
    
    for task in task_details['tasks']:
        if task['lastStatus'] != 'PENDING':
            continue
        
        # PENDING 継続時間を計算
        created_at = task['createdAt']
        pending_duration = (datetime.now(created_at.tzinfo) - created_at).total_seconds()
        
        print(f"Task {task['taskArn']}: PENDING for {pending_duration:.0f}s")
        
        # 3分超えたらアウト
        if pending_duration >= 180:
            print(f"🚨 Long PENDING detected: {pending_duration:.0f}s")
            
            # 即座にフェイルオーバー
            trigger_failover(
                service_name='analytics',
                reason=f'PENDING_TIMEOUT_{pending_duration:.0f}s'
            )
            break

def trigger_failover(service_name, reason):
    """
    Failover Orchestrator を非同期起動
    """
    lambda_client = boto3.client('lambda')
    
    lambda_client.invoke(
        FunctionName='FargateFailoverOrchestrator',
        InvocationType='Event',  # 非同期
        Payload=json.dumps({
            'service_name': service_name,
            'reason': reason,
            'timestamp': int(time.time())
        })
    )

なぜ 3分?

実測データ:

正常起動時間:
- P50: 1.5分
- P95: 2.0分
- P99: 2.3分

3分 = P99 + 30% バッファ

3分超えた PENDING は 95% 以上の確率で最終的に失敗する。

3. フェイルオーバー実行器

最も重要なコンポーネント:

import boto3
import time

ecs = boto3.client('ecs')
sns = boto3.client('sns')

def lambda_handler(event, context):
    """
    Spot → Standard へのフェイルオーバーを実行
    """
    service_name = event['service_name']
    reason = event['reason']
    
    print(f"🚨 Failover started for {service_name}")
    print(f"   Reason: {reason}")
    
    start_time = time.time()
    
    # 1. Spot サービスの現在の状態を取得
    spot_service = ecs.describe_services(
        cluster='AnalyticsCluster',
        services=[f'{service_name}-spot']
    )['services'][0]
    
    desired_count = spot_service['desiredCount']
    running_count = spot_service['runningCount']
    
    print(f"   Spot service: desired={desired_count}, running={running_count}")
    
    if desired_count == 0:
        print("⚠️ Spot service already scaled to 0, aborting")
        return
    
    # 2. Standard サービスを起動
    print(f"⏳ Scaling up Standard service to {desired_count}...")
    
    ecs.update_service(
        cluster='AnalyticsCluster',
        service=f'{service_name}-standard',
        desiredCount=desired_count
    )
    
    # 3. Standard サービスが安定するのを待つ
    print("⏳ Waiting for Standard service to stabilize...")
    
    waiter = ecs.get_waiter('services_stable')
    waiter.wait(
        cluster='AnalyticsCluster',
        services=[f'{service_name}-standard'],
        WaiterConfig={
            'Delay': 15,  # 15秒ごとにチェック
            'MaxAttempts': 40  # 最大10分待つ
        }
    )
    
    # 4. Spot サービスを停止
    print("⏳ Scaling down Spot service to 0...")
    
    ecs.update_service(
        cluster='AnalyticsCluster',
        service=f'{service_name}-spot',
        desiredCount=0
    )
    
    # 5. 状態を DynamoDB に保存
    table.put_item(Item={
        'service_name': service_name,
        'failover_active': True,
        'failover_time': int(time.time()),
        'original_desired_count': desired_count,
        'failover_reason': reason,
        'ttl': int(time.time()) + 86400  # 24時間後に削除
    })
    
    # 6. アラート送信
    elapsed = time.time() - start_time
    
    sns.publish(
        TopicArn='arn:aws:sns:*:*:FargateFailoverAlerts',
        Subject=f'🚨 Failover Executed: {service_name}',
        Message=f'''
Fargate Spot フェイルオーバーが実行されました

サービス: {service_name}
理由: {reason}
タスク数: {desired_count}
所要時間: {elapsed:.1f}秒

現在の状態:
- Spot Service: 0タスク (停止中)
- Standard Service: {desired_count}タスク (実行中)

システムは自動的に回復を監視し、Spot が安定したら自動復帰します。
手動介入は不要です。
        '''
    )
    
    print(f"✅ Failover completed in {elapsed:.1f}s")
    
    return {
        'statusCode': 200,
        'elapsed_time': elapsed,
        'desired_count': desired_count
    }

重要な設計判断:

Standard を先に起動: サービスの空白期間をゼロに
Waiter で確実に待つ: Standard が起動してから Spot を停止
状態を永続化: Lambda は次回実行時に前回の状態を知らないので DynamoDB に保存

4. 回復検出と自動復帰

EventBridge Rule(Spot タスクの成功を監視):

{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Task State Change"],
  "detail": {
    "clusterArn": ["arn:aws:ecs:*:*:cluster/AnalyticsCluster"],
    "lastStatus": ["RUNNING"],
    "capacityProviderName": ["FARGATE_SPOT"]
  }
}

Lambda 処理:

def monitor_recovery(event, context):
    """
    Spot タスクの成功を監視して回復を検出
    """
    service_name = extract_service_name(event['detail'])
    
    # フェイルオーバー中かチェック
    response = table.get_item(Key={'service_name': service_name})
    
    if 'Item' not in response or not response['Item'].get('failover_active'):
        return  # フェイルオーバー中でない
    
    # エラーカウントをリセット
    table.update_item(
        Key={'service_name': service_name},
        UpdateExpression='SET error_count = :zero, last_success_time = :time',
        ExpressionAttributeValues={
            ':zero': 0,
            ':time': int(time.time())
        }
    )
    
    print(f"✅ Spot task succeeded, error count reset")
    
    # 連続成功時間をチェック
    success_duration = get_continuous_success_duration(service_name)
    
    print(f"   Continuous success: {success_duration:.0f}s")
    
    # 10分間連続成功で回復と判断
    if success_duration >= 600:
        print("🎉 Spot appears stable, triggering cleanup...")
        trigger_cleanup(service_name)

def get_continuous_success_duration(service_name):
    """
    最後の成功から現在までの時間を取得
    """
    response = table.get_item(Key={'service_name': service_name})
    last_success = response['Item'].get('last_success_time', 0)
    return time.time() - last_success

なぜ 10分待つ?

経験則:

Spot が一瞬回復 → すぐ死ぬ → また切り替え = 無駄
10分間安定 = 本当に回復した可能性が高い

5. クリーンアップ:Spot への自動復帰

def cleanup_failover(event, context):
    """
    Standard → Spot へ復帰
    """
    service_name = event['service_name']
    
    print(f"🔄 Starting cleanup for {service_name}")
    
    # 元のタスク数を取得
    item = table.get_item(Key={'service_name': service_name})['Item']
    original_count = item['original_desired_count']
    
    # 1. Spot サービスを復元
    print(f"⏳ Restoring Spot service to {original_count} tasks...")
    
    ecs.update_service(
        cluster='AnalyticsCluster',
        service=f'{service_name}-spot',
        desiredCount=original_count
    )
    
    # 2. Spot サービスの安定を待つ
    waiter = ecs.get_waiter('services_stable')
    waiter.wait(
        cluster='AnalyticsCluster',
        services=[f'{service_name}-spot'],
        WaiterConfig={'Delay': 15, 'MaxAttempts': 40}
    )
    
    # 3. Standard サービスを停止
    print("⏳ Scaling down Standard service to 0...")
    
    ecs.update_service(
        cluster='AnalyticsCluster',
        service=f'{service_name}-standard',
        desiredCount=0
    )
    
    # 4. 状態をクリア
    table.update_item(
        Key={'service_name': service_name},
        UpdateExpression='SET failover_active = :false',
        ExpressionAttributeValues={':false': False}
    )
    
    print(f"✅ Cleanup completed, back to Spot with {original_count} tasks")
    
    # アラート
    sns.publish(
        TopicArn='arn:aws:sns:*:*:FargateFailoverAlerts',
        Subject=f'✅ Recovery Completed: {service_name}',
        Message=f'''
Spot キャパシティが回復し、自動的に復帰しました

サービス: {service_name}
タスク数: {original_count}

現在の状態:
- Spot Service: {original_count}タスク (実行中) ✅
- Standard Service: 0タスク (停止中)
        '''
    )

パフォーマンス実測:実際の数字

フェイルオーバー時間

100回のフェイルオーバーを実測:

段階	P50	P95	P99
Error Detection	5秒	15秒	30秒
Standard 起動開始	2秒	5秒	8秒
Standard タスク起動	120秒	150秒	180秒
Spot 停止	10秒	20秒	30秒
合計(即時失敗)	137秒	190秒	248秒

PENDING タイムアウトの場合:

PENDING 検出待機: 180秒
+ フェイルオーバー実行: 137秒
= 合計: 317秒 (約5分)

サービス中断時間

最悪ケース(全 Spot タスクが同時に死んだ場合):

00:00 - Spot タスク全滅
00:05 - エラー検出(3回の閾値到達)
00:07 - Standard 起動開始
02:07 - Standard タスク起動完了(50%以上)
02:17 - Spot 停止完了

実質サービス中断: 2分17秒

実測データ(過去3ヶ月):

フェイルオーバー発生回数: 12回
平均中断時間: 2.3分
最長中断時間: 4.1分
データ損失: 0件

月間可用性計算

総稼働時間: 30日 × 24時間 × 60分 = 43,200分

中断時間:
- フェイルオーバー: 12回 × 2.3分 = 27.6分
- その他メンテナンス: 10分

総中断時間: 37.6分

可用性 = (43,200 - 37.6) / 43,200 × 100
       = 99.91% ✅

コスト分析:実際の請求データ

構成

vCPU: 4
Memory: 16GB
平均タスク数: 200
月間稼働時間: 720時間

Before: 100% Standard

vCPU: 4 × 200 × 720 × $0.04656 = $26,813
Memory: 16 × 200 × 720 × $0.00511 = $11,774

合計: $38,587/月

After: Spot + Failover

実測データ(3ヶ月平均):

Spot 稼働時間: 95%
Standard 稼働時間: 5%

Spot コスト:
vCPU: 4 × 200 × (720 × 0.95) × $0.02049 = $11,183
Memory: 16 × 200 × (720 × 0.95) × $0.00225 = $4,903
Spot 合計: $16,086

Standard コスト:
vCPU: 4 × 200 × (720 × 0.05) × $0.04656 = $1,341
Memory: 16 × 200 × (720 × 0.05) × $0.00511 = $589
Standard 合計: $1,930

コントロールプレーン:
Lambda: 約20万回実行 × $0.0000002 = $40
DynamoDB: オンデマンド、読み書き = $25
EventBridge: ルール実行 = $5
SNS: アラート通知 = $2
コントロールプレーン合計: $72

総コスト: $16,086 + $1,930 + $72 = $18,088/月

削減効果

削減額: $38,587 - $18,088 = $20,499/月
削減率: 53.1%

年間削減: $20,499 × 12 = $245,988

ROI 計算:

開発コスト: 2週間 × 2エンジニア × $10K = $40K
年間削減: $246K

ROI = ($246K - $40K) / $40K × 100 = 515%

回収期間: 40K / (246K/12) = 1.9ヶ月

踏んだ地雷と解決策

地雷 1: フェイルオーバー無限ループ

現象:

09:00 Spot エラー → Standard 起動
09:05 Standard タスクでエラー検出 → Spot 起動!?
09:10 Spot エラー → Standard 起動
...無限ループ

原因: EventBridge Rule が 全てのタスクのエラーを拾っていた

解決:

{
  "detail": {
    "capacityProviderName": ["FARGATE_SPOT"]  // ← これ追加
  }
}

Spot タスクのみ監視、Standard は無視。

地雷 2: DynamoDB ホットキー

現象:

[ERROR] ProvisionedThroughputExceededException

原因: 1秒に 50回エラーが発生 → 同じキーに 50回書き込み

解決: アトミックカウンター使用

# ❌ 悪い例
item = table.get_item(...)
item['error_count'] += 1
table.put_item(item)

# ✅ 良い例
table.update_item(
    Key={'service_name': service_name},
    UpdateExpression='ADD error_count :inc'
)

地雷 3: Waiter タイムアウト

現象: Standard 起動待ちで Lambda がタイムアウト(15分)

原因: services_stable の判定条件が厳しすぎる

解決: カスタム待機ロジック

def wait_for_service_ready(service_name, min_healthy_percent=50):
    """
    最低限のタスクが起動したら OK とする
    """
    for attempt in range(60):  # 最大10分
        service = ecs.describe_services(
            cluster='AnalyticsCluster',
            services=[service_name]
        )['services'][0]
        
        desired = service['desiredCount']
        running = service['runningCount']
        
        healthy_percent = (running / desired) * 100 if desired > 0 else 0
        
        print(f"Attempt {attempt}: {running}/{desired} tasks ({healthy_percent:.0f}%)")
        
        if healthy_percent >= min_healthy_percent:
            print(f"✅ Service ready: {healthy_percent:.0f}% >= {min_healthy_percent}%")
            return True
        
        time.sleep(10)
    
    raise TimeoutError(f"Service {service_name} did not become ready")

50% 起動したら次に進む。100% 待つ必要なし。

地雷 4: Spot 回復誤判定

現象:

10:00 Spot 1タスク成功 → 回復判定
10:01 Standard → Spot 切り替え開始
10:03 Spot 大量エラー → また Standard へ
10:05 また Spot へ...

原因: 1タスク成功しただけで「回復した!」と判断

解決: 安定ウィンドウ 導入

def is_spot_really_stable(service_name, window_minutes=10):
    """
    本当に Spot が安定しているか総合判定
    """
    # 過去 10分間のタスク状態を取得
    recent_tasks = get_tasks_in_time_window(service_name, window_minutes)
    
    if len(recent_tasks) < 10:
        print("⚠️ Not enough samples yet")
        return False
    
    # 成功率チェック
    success_count = sum(1 for t in recent_tasks if t['status'] == 'RUNNING')
    success_rate = success_count / len(recent_tasks)
    
    print(f"Success rate: {success_rate:.1%}")
    
    if success_rate < 0.8:  # 80% 未満は NG
        print("❌ Success rate too low")
        return False
    
    # エラー率チェック
    error_count = sum(1 for t in recent_tasks if t.get('error'))
    error_rate = error_count / len(recent_tasks)
    
    if error_rate > 0.05:  # 5% 超えたら NG
        print(f"❌ Error rate too high: {error_rate:.1%}")
        return False
    
    # PENDING 長期化チェック
    long_pending = [t for t in recent_tasks 
                   if t['status'] == 'PENDING' and t['duration'] > 180]
    
    if long_pending:
        print(f"❌ {len(long_pending)} tasks stuck in PENDING")
        return False
    
    print("✅ Spot appears stable")
    return True

判定条件:

過去10分間で 10タスク以上の実行履歴
成功率 ≥ 80%
エラー率 ≤ 5%
PENDING 3分超過なし

全てクリアして初めて「安定」と判定。

地雷 5: EventBridge では PENDING を検出できない

現象: タスクが PENDING で詰まっているのに、EventBridge イベントが来ない

原因: EventBridge は状態変化時にしか発火しない

PENDING → PENDING → PENDING... (状態変化なし = イベントなし)

解決: 定期スキャン を追加(前述の scan_pending_tasks)

イベント駆動(EventBridge):
  STOPPED, RUNNING の検出

時間駆動(CloudWatch Events 1分毎):
  PENDING 長期化の検出
  
両方組み合わせて完全カバー ✅

運用監視:何を見るか

必須 CloudWatch Metrics

# カスタムメトリック発行
def publish_metrics(service_name):
    cloudwatch = boto3.client('cloudwatch')
    
    # 1. Spot ヘルススコア(0-100)
    health_score = calculate_health_score(service_name)
    
    cloudwatch.put_metric_data(
        Namespace='Fargate/Failover',
        MetricData=[
            {
                'MetricName': 'SpotHealthScore',
                'Value': health_score,
                'Unit': 'None',
                'Dimensions': [
                    {'Name': 'ServiceName', 'Value': service_name}
                ]
            },
            {
                'MetricName': 'FailoverActive',
                'Value': 1 if is_failover_active() else 0,
                'Unit': 'None'
            },
            {
                'MetricName': 'ErrorCount5min',
                'Value': get_error_count_last_5min(),
                'Unit': 'Count'
            },
            {
                'MetricName': 'CostSavingsToday',
                'Value': calculate_savings_today(),
                'Unit': 'None'  # ドル
            }
        ]
    )

def calculate_health_score(service_name):
    """
    Spot の健全性を 0-100 でスコア化
    """
    score = 100
    
    # エラーカウントでペナルティ
    error_count = get_error_count_last_5min()
    score -= error_count * 10  # 1エラーで -10点
    
    # PENDING タスクでペナルティ
    pending_count = get_pending_task_count()
    score -= pending_count * 5  # 1タスクで -5点
    
    # 成功タスクでボーナス
    success_count = get_success_count_last_5min()
    score += success_count * 2  # 1成功で +2点
    
    return max(0, min(100, score))

CloudWatch Dashboard

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "Spot Health Score",
        "metrics": [
          ["Fargate/Failover", "SpotHealthScore"]
        ],
        "yAxis": {"left": {"min": 0, "max": 100}},
        "annotations": {
          "horizontal": [
            {"value": 30, "label": "Critical", "color": "#d13212"},
            {"value": 60, "label": "Warning", "color": "#ff9900"}
          ]
        }
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Failover Status",
        "metrics": [
          ["Fargate/Failover", "FailoverActive"]
        ],
        "yAxis": {"left": {"min": 0, "max": 1}}
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Cost Savings (Today)",
        "metrics": [
          ["Fargate/Failover", "CostSavingsToday"]
        ],
        "stat": "Maximum"
      }
    }
  ]
}

アラート設定

SpotHealthLowAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: Spot-Health-Low
    MetricName: SpotHealthScore
    Namespace: Fargate/Failover
    Statistic: Average
    Period: 300  # 5分
    EvaluationPeriods: 2
    Threshold: 30
    ComparisonOperator: LessThanThreshold
    AlarmActions:
      - !Ref CriticalAlertTopic
    AlarmDescription: "Spot health score below 30 for 10 minutes"

FailoverFrequencyAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: Failover-Too-Frequent
    MetricName: FailoverCount
    Namespace: Fargate/Failover
    Statistic: Sum
    Period: 86400  # 1日
    EvaluationPeriods: 1
    Threshold: 5
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
      - !Ref WarningAlertTopic
    AlarmDescription: "More than 5 failovers in a day"

まとめ:エンジニアリングの本質

やって良かったこと

問題の本質を見抜く:
- 単なる「コスト削減したい」ではなく
- 「Spot の不確実性をどう吸収するか」という設計課題
イベント駆動 + 時間駆動のハイブリッド:
- EventBridge: 即座の状態変化を検出
- CloudWatch Events: 状態変化しない異常(PENDING)を検出
- 両方必要
段階的な判断:
- 1回のエラーで切り替えない(閾値 3-5回)
- 1回の成功で復帰しない(10分間安定確認)
- 誤判定を極限まで減らす
完全自動化:
- 深夜 3時に Spot が死んでも起きなくて良い
- 運用負荷ゼロ

適用シナリオ

✅ この方式が向いている:

長時間稼働するバックグラウンド処理
並列化可能なステートレスタスク
数分の中断は許容できる(SLA 99% 以上で OK)
コスト削減が最優先

❌ 向いていない:

リアルタイム性が必須(レイテンシ < 1秒)
ステートフルな処理(DB コネクション保持等)
SLA 99.99% 以上が必要
タスク数が少ない(< 10タスク、切り替えコスト > 削減効果)

実測 ROI

初期開発コスト: $40K
  - 設計・実装: 2週間
  - テスト・検証: 1週間
  - 本番展開: 3日

年間削減額: $246K

投資回収期間: 1.9ヶ月
3年間 ROI: ($246K × 3 - $40K) / $40K = 1745%

次のステップ

予測的フェイルオーバー:

# 過去データから Spot 不安定時間帯を学習
if is_peak_hour() and spot_health_score < 70:
    # 事前に Standard に切り替え
    preemptive_failover()

マルチリージョン対応:

# 東京リージョンで Spot 枯渇
# → シンガポールリージョンで起動

コスト最適化:

# Standard 稼働時間が 10% 超えたら
# → Savings Plans 購入を検討

技術スタック

Infrastructure:
  - ECS Fargate (Spot + Standard)
  - EventBridge (Event-driven triggers)
  - CloudWatch Events (Time-driven triggers)
  - Lambda (Orchestration logic)
  - DynamoDB (State management)
  - SNS (Alerting)

Deployment:
  - Terraform (IaC)
  - GitHub Actions (CI/CD)

Monitoring:
  - CloudWatch Metrics & Logs
  - CloudWatch Dashboard
  - SNS → Slack integration

最後に

Spot は「使えない」のではなく「使いこなす」もの。

適切な自動化を組めば:

コスト 50% 削減
可用性 99.9% 維持
運用負荷ゼロ

全て両立できる。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up