2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

AWS ECSでタスク・コンテナレベルのリソース監視を設定する方法

Last updated at Posted at 2023-04-24

はじめに

この記事では、AWSElasticContainerService(ECS)でタスクおよびコンテナレベルのリソース監視を設定する方法について説明します。この監視設定を通じて、ECSクラスタのパフォーマンスやリソース使用状況を継続的に追跡し、問題が発生した場合に迅速に対処できるようになります。

前提条件

  • Fargate(Linuxプラットフォームバージョン1.4.0以降)を使用し、監視対象とするコンテナは下記
    • web
    • api
    • worker
  • ハードリミットのみ設定し、ソフトリミットは未設定
  • sns,chatbotと連携し、任意のサービスに通知

タスクとコンテナレベルでのリソース監視

1. タスクのハードリミットを監視

Amazon ECS Container Insights メトリクス

aws_cloudwatch_metric_alarmリソースを使用して、CPU、メモリ、ディスク使用率、タスク数、サービス数のアラームを作成します。

  • ハードリミット

    • cpu 2048(2vCPU)
    • memory 4096(4GiB)
  • 閾値

    • cpu >=80%
    • memory >=80%
    • disk >=80%
    • タスク数 <2count
    • サービス数 <1count
data.tf
data "aws_sns_topic" "example" {
  name = "example"
}
local.tf
locals {
  container_names = ["web", "api", "worker"]
}

locals {
  alert_arn = data.aws_sns_topic.example.arn
}

locals {
  example_cluster_hardlimit = {
    cpu_usage = {
      title         = "example-cluster-hardlimit-cpuutilized"
      metric_name_1 = "CpuReserved"
      metric_name_2 = "CpuUtilized"
      stat          = "Average"
    }
    memory_usage = {
      title         = "example-cluster-hardlimit-momoryutilized"
      metric_name_1 = "MemoryReserved"
      metric_name_2 = "MemoryUtilized"
      stat          = "Average"
    }
    disk_usage = {
      title         = "example-cluster-hardlimit-diskspace"
      metric_name_1 = "EphemeralStorageReserved"
      metric_name_2 = "EphemeralStorageUtilized"
      stat          = "Maximum"
    }
  }
}

locals {
  example_cluster_metrics_data = {
    task_count = {
      region    = "ap-northeast-1"
      title     = "example-cluster-task-count"
      threshold = "2"
      metrics = [
        ["ECS/ContainerInsights", "TaskCount", "ClusterName", "example", { stat = "Average", region = "ap-northeast-1" }]
      ]
    }
    service_count = {
      region    = "ap-northeast-1"
      title     = "example-cluster-service-count"
      threshold = "1"
      metrics = [
        ["ECS/ContainerInsights", "ServiceCount", "ClusterName", "example", { stat = "Average", region = "ap-northeast-1" }]
      ]
    }
  }
}
cluster_cloudwatch_metric_alarm.tf
// cluster
resource "aws_cloudwatch_metric_alarm" "example_cluster_hardlimit" {
  for_each = local.example_cluster_hardlimit

  alarm_name          = each.value.title
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  alarm_description   = "This metric checks for ${each.value.title}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  threshold           = "80"

  metric_query {
    id          = "expr1"
    expression  = "mm1m0 * 100 / mm0m0"
    label       = each.value.title
    return_data = true
  }

  metric_query {
    id = "mm0m0"
    metric {
      metric_name = each.value.metric_name_1
      namespace   = "ECS/ContainerInsights"
      period      = "60"
      stat        = "Average"
      dimensions = {
        ClusterName = "example"
      }
    }
  }

  metric_query {
    id = "mm1m0"
    metric {
      metric_name = each.value.metric_name_2
      namespace   = "ECS/ContainerInsights"
      period      = "60"
      stat        = "Average"
      dimensions = {
        ClusterName = "example"
      }
    }
  }
}

resource "aws_cloudwatch_metric_alarm" "example_cluster" {
  for_each = local.example_cluster_metrics_data

  alarm_name          = each.value.title
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "1"
  metric_name         = element(each.value.metrics[0], 1)
  namespace           = element(each.value.metrics[0], 0)
  period              = "60"
  statistic           = "Sum"
  threshold           = each.value.threshold
  alarm_description   = "This metric checks for ${each.key}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  dimensions = {
    ClusterName = element(each.value.metrics[0], 3)
  }
}
  • dashboard
    スクリーンショット 2023-04-24 16.45.58.png

2. コンテナレベルでのサーバリソース監視

ContainerInsightsとしては個々のタスク単位でのメトリクスは送信していないが、パフォーマンスログが記録されるロググループにてメトリクスフィルターを使用し、個々のタスク単位でメトリックを抽出します。
メトリクスフィルタで指定したContainerInsightsパフォーマンスログのCpuUtilizedについては、CPU使用率ではなくCPUユニット数です。コンテナ定義パラメータのソフトリミットを設定していない場合、コンテナはハードリミットで設定した値までCPUを使用可能となります。したがってコンテナ単位のCPU使用上限はなく、コンテナ単位のCPU使用率は計算できません。(CpuReservedの値は0となる)

  • 閾値(ハードリミットに設定した閾値に達する前に通知)
    • cpu >=2048COUNT
    • memory >=512MB
cloudwatch_log_metric_filter.tf
resource "aws_cloudwatch_log_metric_filter" "example_container_cpu_utilization" {
  log_group_name = "/aws/ecs/containerinsights/example/performance"
  name           = "example-container-cpu-utilization"
  pattern        = "{$.ContainerName=\"*\" && $.CpuUtilized=\"*\"}"

  metric_transformation {
    dimensions = {
      "ContainerName" = "$.ContainerName"
    }
    name      = "CPUUtilization"
    namespace = "example"
    unit      = "Count"
    value     = "$.CpuUtilized"
  }
}

resource "aws_cloudwatch_log_metric_filter" "example_container_memory_utilization" {
  log_group_name = "/aws/ecs/containerinsights/example/performance"
  name           = "example-container-memory-utilization"
  pattern        = "{$.ContainerName=\"*\" && $.MemoryUtilized=\"*\"}"

  metric_transformation {
    dimensions = {
      "ContainerName" = "$.ContainerName"
    }
    name      = "MemoryUtilization"
    namespace = "example"
    unit      = "Bytes"
    value     = "$.MemoryUtilized"
  }
}
task_cloudwatch_metric_alarm.tf
// task
resource "aws_cloudwatch_metric_alarm" "example_container_cpu_utilization_alarms" {
  for_each            = toset(local.container_names)
  alarm_name          = "example-container-cpu-utilization-alarm-${each.key}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "CPUUtilization"
  namespace           = "example"
  period              = "60"
  statistic           = "Average"
  threshold           = "2048"
  alarm_description   = "This metric checks for high CPU utilization on container ${each.key}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  dimensions = {
    ContainerName = each.key
  }
}

resource "aws_cloudwatch_metric_alarm" "example_container_memory_utilization_alarms" {
  for_each            = toset(local.container_names)
  alarm_name          = "example-container-memory-utilization-alarm-${each.key}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "MemoryUtilization"
  namespace           = "example"
  period              = "60"
  statistic           = "Average"
  threshold           = "536870912" // 512MB
  alarm_description   = "This metric checks for high memory utilization on container ${each.key}"
  alarm_actions       = [local.alert_arn]
  ok_actions          = [local.alert_arn]
  dimensions = {
    ContainerName = each.key
  }
}
  • dashboard
    スクリーンショット 2023-05-08 9.42.27.png

まとめ

この記事では、タスクおよびコンテナレベルでのリソース監視について説明しました。Terraformを使用して、AWSCloudWatchダッシュボード、メトリックフィルタ、アラームを作成することで、リソースの使用状況を監視し、問題が発生した際に速やかに対応できるようになります。

2
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?