Datadogでcloudsqlのインスタンスタイプ最適を自動通知するようにしてみた

Posted at 2024-09-01

対象者

cloudsqlのインスタンスタイプの最適化をさらに精緻な値で行いたい人
最適化されていないインスタンスタイプのcloudsqlがあった際に自動で通知して欲しい人
Datadogを使用している方

前書き

コスト最適化。大事ですよね。GCPにはActive Assistに費用という項目があり、GCEやCloudSQLのインスタンスタイプが最適でない場合、適切なインスタンスタイプになるように通知してくれたり、収集された過去と最近の使用状況の指標を分析し、最適なCUDの提案をしてくれます。どこから手をつけたらよいかわからない場合はまずはActive Assistantの費用とパフォーマンスを見ると良いでしょう。現状では以下の内容について提案をしてくれます。

VM マシンタイプ Recommende
確約利用割引 Recommender
アイドル状態の VM Recommender
Cloud SQL オーバープロビジョニングされたインスタンス Recommender

ただし、このCloud SQLの通知は割とざっくりしていて全く使っていないインスタンスがあっても通知してくれなかったりするんですよね。

そこで今回は自分でしきい値を設定して、cloudsqlでリソースが使用されていない場合にはslack通知をするような仕組みを作っていきたいと思います。

実装

まずはDatadogでGCP Integrationをします。terraformで実施する場合はこんな感じです(サンプル)。


// Service account should have compute.viewer, monitoring.viewer, cloudasset.viewer, and browser roles (the browser role is only required in the default project of the service account).
resource "google_service_account" "datadog_integration" {
  account_id   = "datadogintegration"
  display_name = "Datadog Integration"
  project      = "gcp-project"
}

// Grant token creator role to the Datadog principal account.
resource "google_service_account_iam_member" "sa_iam" {
  service_account_id = google_service_account.datadog_integration.name
  role               = "roles/iam.serviceAccountTokenCreator"
  member             = format("serviceAccount:%s", datadog_integration_gcp_sts.foo.delegate_account_email)
}

resource "datadog_integration_gcp_sts" "foo" {
  client_email    = google_service_account.datadog_integration.email
  host_filters    = ["filter_one", "filter_two"]
  automute        = true
  is_cspm_enabled = false
}

次にAlertの条件を設定します。CPUが十分に使用されておらず、メモリも必要以上に確保されているケース、かつインスタンスサイズが一定以上のもののみを検知したいので、今回は以下のような設定を試験的に入れてみました。

CPU使用率が1週間以上10％以下を推移しているもの
Memory使用率が1週間以上30％以下を推移しているもの
Memoryのサイズが6GB以上のもの
以上の条件にすべて合致したインスタンスのみAlert状態となります。その設定はDatadogではcomposite monitorで実装しています。

# main alert
resource "datadog_monitor" "cost_alert" {
  name  = "Cost Alert"
  type  = "composite"
  query = "${datadog_monitor.cpu_utilization.id} && ${datadog_monitor.memory_utilization.id} && ${datadog_monitor.memory_quota.id}"

  require_full_window = false
  notify_no_data      = false
  timeout_h           = 0

  include_tags = false
    message      = ""
}

# Alert if CPU utilisation does not exceed 10% for more than one month.
resource "datadog_monitor" "cpu_utilization" {
  name  = "CPU Utilization"
  type  = "query alert"
  query = "max(last_1w):max:gcp.cloudsql.database.cpu.utilization{*} by {database_id} < 0.1"

  monitor_thresholds {
    critical = 0.1
  }

  require_full_window = false
  notify_no_data      = false
  timeout_h           = 0
  evaluation_delay    = 660

  include_tags = false
    message      = ""
}

# Alert if Memory utilisation does not exceed 30% for more than one month.
resource "datadog_monitor" "memory_utilization" {
  name  = "Memory Utilization"
  type  = "query alert"
  query = "max(last_1w):max:gcp.cloudsql.database.memory.quota{*} by {database_id} / avg:gcp.cloudsql.database.memory.total_usage{*} by {database_id} < 30"

  monitor_thresholds {
    critical = 30
  }

  require_full_window = false
  notify_no_data      = false
  timeout_h           = 0
  evaluation_delay    = 660

  include_tags = false
    message      = ""
}

# Alert if Instances with 6GB of memory (to deter alerts on smaller instances).
resource "datadog_monitor" "memory_quota" {
  name  = "Memory Quota Over 6GB"
  type  = "query alert"
  query = "max(last_1h):max:gcp.cloudsql.database.memory.quota{*} by {database_id} > 6144000000"

  monitor_thresholds {
    critical = 6144000000
  }

  require_full_window = false
  notify_no_data      = false
  timeout_h           = 0
  evaluation_delay    = 660

  include_tags = false
    message      = ""
}

あとがき

如何でしたでしょうか。自分たちの環境では100台以上のインスタンス中５台がアラートに合致しました。このモニタの良いところは都度SpreadSheetとにらめっこしたり、インスタンスを一台ずつ確認しておかなくてもアラートがslackに通知され次第対応すれば良いということですね。evaluationの期間についてはまだ試験的に1週間としていますが、1ヶ月等にしても良いかもなと考えています。また、今回Datadogを使用しましたがGCP純正のMonitoring,Alertの機能でも同じことが実装できそうですね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up