More than 1 year has passed since last update.

AWS ECSでタスク・コンテナレベルのリソース監視を設定する方法

Last updated at 2023-05-08Posted at 2023-04-24

はじめに

この記事では、AWSElasticContainerService(ECS)でタスクおよびコンテナレベルのリソース監視を設定する方法について説明します。この監視設定を通じて、ECSクラスタのパフォーマンスやリソース使用状況を継続的に追跡し、問題が発生した場合に迅速に対処できるようになります。

前提条件

Fargate（Linuxプラットフォームバージョン1.4.0以降）を使用し、監視対象とするコンテナは下記
- web
- api
- worker
ハードリミットのみ設定し、ソフトリミットは未設定
sns,chatbotと連携し、任意のサービスに通知

タスクとコンテナレベルでのリソース監視

1. タスクのハードリミットを監視

Amazon ECS Container Insights メトリクス

aws_cloudwatch_metric_alarmリソースを使用して、CPU、メモリ、ディスク使用率、タスク数、サービス数のアラームを作成します。

ハードリミット
- cpu 2048(2vCPU)
- memory 4096(4GiB)
閾値
- cpu >=80%
- memory >=80%
- disk >=80%
- タスク数　<2count
- サービス数 <1count

data.tf

data "aws_sns_topic" "example" {
  name = "example"
}

local.tf

locals {
  container_names = ["web", "api", "worker"]
}

locals {
  alert_arn = data.aws_sns_topic.example.arn
}

locals {
  example_cluster_hardlimit = {
    cpu_usage = {
      title         = "example-cluster-hardlimit-cpuutilized"
      metric_name_1 = "CpuReserved"
      metric_name_2 = "CpuUtilized"
      stat          = "Average"
    }
    memory_usage = {
      title         = "example-cluster-hardlimit-momoryutilized"
      metric_name_1 = "MemoryReserved"
      metric_name_2 = "MemoryUtilized"
      stat          = "Average"
    }
    disk_usage = {
      title         = "example-cluster-hardlimit-diskspace"
      metric_name_1 = "EphemeralStorageReserved"
      metric_name_2 = "EphemeralStorageUtilized"
      stat          = "Maximum"
    }
  }
}

locals {
  example_cluster_metrics_data = {
    task_count = {
      region    = "ap-northeast-1"
      title     = "example-cluster-task-count"
      threshold = "2"
      metrics = [
        ["ECS/ContainerInsights", "TaskCount", "ClusterName", "example", { stat = "Average", region = "ap-northeast-1" }]
      ]
    }
    service_count = {
      region    = "ap-northeast-1"
      title     = "example-cluster-service-count"
      threshold = "1"
      metrics = [
        ["ECS/ContainerInsights", "ServiceCount", "ClusterName", "example", { stat = "Average", region = "ap-northeast-1" }]
      ]
    }
  }
}

cluster_cloudwatch_metric_alarm.tf

// cluster
resource "aws_cloudwatch_metric_alarm" "example_cluster_hardlimit" {
  for_each = local.example_cluster_hardlimit

  alarm_name          = each.value.title
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  alarm_description   = "This metric checks for ${each.value.title}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  threshold           = "80"

  metric_query {
    id          = "expr1"
    expression  = "mm1m0 * 100 / mm0m0"
    label       = each.value.title
    return_data = true
  }

  metric_query {
    id = "mm0m0"
    metric {
      metric_name = each.value.metric_name_1
      namespace   = "ECS/ContainerInsights"
      period      = "60"
      stat        = "Average"
      dimensions = {
        ClusterName = "example"
      }
    }
  }

  metric_query {
    id = "mm1m0"
    metric {
      metric_name = each.value.metric_name_2
      namespace   = "ECS/ContainerInsights"
      period      = "60"
      stat        = "Average"
      dimensions = {
        ClusterName = "example"
      }
    }
  }
}

resource "aws_cloudwatch_metric_alarm" "example_cluster" {
  for_each = local.example_cluster_metrics_data

  alarm_name          = each.value.title
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "1"
  metric_name         = element(each.value.metrics[0], 1)
  namespace           = element(each.value.metrics[0], 0)
  period              = "60"
  statistic           = "Sum"
  threshold           = each.value.threshold
  alarm_description   = "This metric checks for ${each.key}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  dimensions = {
    ClusterName = element(each.value.metrics[0], 3)
  }
}

dashboard

2. コンテナレベルでのサーバリソース監視

ContainerInsightsとしては個々のタスク単位でのメトリクスは送信していないが、パフォーマンスログが記録されるロググループにてメトリクスフィルターを使用し、個々のタスク単位でメトリックを抽出します。
メトリクスフィルタで指定したContainerInsightsパフォーマンスログのCpuUtilizedについては、CPU使用率ではなくCPUユニット数です。コンテナ定義パラメータのソフトリミットを設定していない場合、コンテナはハードリミットで設定した値までCPUを使用可能となります。したがってコンテナ単位のCPU使用上限はなく、コンテナ単位のCPU使用率は計算できません。(CpuReservedの値は0となる)

閾値（ハードリミットに設定した閾値に達する前に通知）
- cpu >=2048COUNT
- memory >=512MB

cloudwatch_log_metric_filter.tf

resource "aws_cloudwatch_log_metric_filter" "example_container_cpu_utilization" {
  log_group_name = "/aws/ecs/containerinsights/example/performance"
  name           = "example-container-cpu-utilization"
  pattern        = "{$.ContainerName=\"*\" && $.CpuUtilized=\"*\"}"

  metric_transformation {
    dimensions = {
      "ContainerName" = "$.ContainerName"
    }
    name      = "CPUUtilization"
    namespace = "example"
    unit      = "Count"
    value     = "$.CpuUtilized"
  }
}

resource "aws_cloudwatch_log_metric_filter" "example_container_memory_utilization" {
  log_group_name = "/aws/ecs/containerinsights/example/performance"
  name           = "example-container-memory-utilization"
  pattern        = "{$.ContainerName=\"*\" && $.MemoryUtilized=\"*\"}"

  metric_transformation {
    dimensions = {
      "ContainerName" = "$.ContainerName"
    }
    name      = "MemoryUtilization"
    namespace = "example"
    unit      = "Bytes"
    value     = "$.MemoryUtilized"
  }
}

task_cloudwatch_metric_alarm.tf

// task
resource "aws_cloudwatch_metric_alarm" "example_container_cpu_utilization_alarms" {
  for_each            = toset(local.container_names)
  alarm_name          = "example-container-cpu-utilization-alarm-${each.key}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "CPUUtilization"
  namespace           = "example"
  period              = "60"
  statistic           = "Average"
  threshold           = "2048"
  alarm_description   = "This metric checks for high CPU utilization on container ${each.key}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  dimensions = {
    ContainerName = each.key
  }
}

resource "aws_cloudwatch_metric_alarm" "example_container_memory_utilization_alarms" {
  for_each            = toset(local.container_names)
  alarm_name          = "example-container-memory-utilization-alarm-${each.key}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "MemoryUtilization"
  namespace           = "example"
  period              = "60"
  statistic           = "Average"
  threshold           = "536870912" // 512MB
  alarm_description   = "This metric checks for high memory utilization on container ${each.key}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  dimensions = {
    ContainerName = each.key
  }
}

dashboard

まとめ

この記事では、タスクおよびコンテナレベルでのリソース監視について説明しました。Terraformを使用して、AWSCloudWatchダッシュボード、メトリックフィルタ、アラームを作成することで、リソースの使用状況を監視し、問題が発生した際に速やかに対応できるようになります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up