0
0

GCP Cloud RunのSLO monitoringをterraformで作成

Last updated at Posted at 2024-02-23

Overview

Cloud Runを使ってサービスをデプロイするのはとても簡単だが、SLO monitoringも簡単に作れることがわかったのでメモ

今回はすでにCloud Run Serviceがデプロイされている状態を想定

手順

Monitoring Serviceの設定: google_monitoring_service

次にgoogle_monitoring_serviceを定義

resource "google_monitoring_service" "sample" {
  service_id   = "sample"
  display_name = "Cloud Run SLO sample"

  basic_service {
    service_type = "CLOUD_RUN"
    service_labels = {
      service_name = "sample" # Cloud Run service name
      location     = "asia-northeast1" # Cloud Run service location
    }
  }
}

サポートされているbasic_serviceはいかの4つ (ref: SLO Monitoring)

  1. APP_ENGINE
  2. CLOUD_ENDPOINTS
  3. CLUSTER_ISTIO
  4. ISTIO_CANONICAL_SERVICE
  5. CLOUD_RUN

今回はCLOUD_RUNを指定して、service_nameとlocationを指定。

SLOの設定: google_monitoring_slo

次にgoogle_monitoring_sloを定義

resource "google_monitoring_slo" "sample_availability" {
  service      = google_monitoring_service.sample.service_id
  slo_id       = "sample-availability"
  display_name = "Availability of Cloud Run sample"

  goal                = 0.99
  rolling_period_days = 30

  basic_sli {
    availability {
      enabled = true
    }
  }
}

もしもrequest baseで自分で設定したい場合には、以下のように設定も可能:

request_based_sli
resource "google_monitoring_slo" "sample_availability" {
  service      = google_monitoring_service.sample.service_id
  slo_id       = "sample-availability"
  display_name = "Availability of Cloud Run sample"

  goal                = 0.99
  rolling_period_days = 30

  request_based_sli {
    good_total_ratio {
      good_service_filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        "resource.label.service_name=\"sample\"",
        "metric.label.response_code_class=\"2xx\""
      ])
      total_service_filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        "resource.label.service_name=\"sample\"",
      ])
    }
  }
}

定義:

  1. SLO: 99%
  2. rolling_period_days: 30日
  3. SLI: (status code 200のリクエストカウント) / (すべてのリクエストカウント)

availabilityの他にもlatencyのSLOを同じサービスに追加することも可

latencyの場合
  basic_sli {
    latency {
      threshold = "1s"
    }
  }

SLO burn rate alertの設定

Slackの連携 (もしもすでに連携済みであればスキップ)

Slack連携する際は https://api.slack.com/apps を作っておいて、OAuth & PermissionsからBot User OAuth Tokenを取得して secret.tfvarsなどにいれる (またはSecretManagerに入れておいて参照する)

variables.tf
variable "slack_bot_user_oauth_token" {
  description = "slack bot user token"
  type        = string
  sensitive   = true
  default     = "dummy"
}
secret.tfvars
slack_bot_user_oauth_token = "xoxb-aaaaaaaaaaa"

google_monitoring_notification_channel でSlack連携をする

resource "google_monitoring_notification_channel" "slack" {
  display_name = "slack"
  type         = "slack"
  labels = {
    "channel_name" = "#alert-channel"
  }
  sensitive_labels {
    auth_token = var.slack_bot_user_oauth_token # need secret.tfvars
  }
}

Slack appは連携するチャンネルに事前に追加しておく必要がある

monitoring_alert_policyの定義

resource "google_monitoring_alert_policy" "availability_long_window" {
  display_name = "SLO burn rate alert"
  combiner     = "AND"
  conditions {
    display_name = "SLO burn rate with long window"
    condition_threshold {
      filter          = "select_slo_burn_rate(\"${google_monitoring_slo.sample_availability.name}\", 60m)"
      threshold_value = "10"
      duration        = "0s"
      comparison      = "COMPARISON_GT"
    }
  }
  notification_channels = [google_monitoring_notification_channel.slack.name]
}

これでCloud Run serviceのSLOタブで確認できるようになる

おまけ

multi-window multi-burn rateの設定

少し冗長ですが、以下のようにavailabilityとlatencyのmulti-window multi-burn rateアラートの設定をすることができます。

locals {
  slo_alert_policies = {
    for policy in [
      {
        service_name = "sample"
        type         = "availability"
        long_window  = "60m"
        short_window = "5m"
        threshold    = "14.4"
        slo_name     = google_monitoring_slo.sample_availability.name
      },
      {
        service_name = "sample"
        type         = "availability"
        long_window  = "6h"
        short_window = "30m"
        threshold    = "6"
        slo_name     = google_monitoring_slo.sample_availability.name
      },
      {
        service_name = "sample"
        type         = "latency"
        long_window  = "60m"
        short_window = "5m"
        threshold    = "14.4"
        slo_name     = google_monitoring_slo.sample_latency.name
      },
      {
        service_name = "sample"
        type         = "latency"
        long_window  = "6h"
        short_window = "30m"
        threshold    = "6"
        slo_name     = google_monitoring_slo.sample_latency.name
      },
    ] : "${policy.service_name} ${policy.type} - ${policy.threshold}" => policy
  }
}
resource "google_monitoring_alert_policy" "cloud_run_slo_burn_rate" {
  for_each     = local.slo_alert_policies
  display_name = each.key
  combiner     = "AND"
  conditions {
    display_name = "SLO burn rate with short window"
    condition_threshold {
      filter          = "select_slo_burn_rate(\"${each.value["slo_name"]}\", ${each.value["short_window"]})"
      threshold_value = each.value["threshold"]
      duration        = "0s"
      comparison      = "COMPARISON_GT"
    }
  }

  conditions {
    display_name = "SLO burn rate with long window"
    condition_threshold {
      filter          = "select_slo_burn_rate(\"${each.value["slo_name"]}\", ${each.value["long_window"]})"
      threshold_value = each.value["threshold"]
      duration        = "0s"
      comparison      = "COMPARISON_GT"
    }
  }
  enabled = true
  notification_channels = [
    data.google_monitoring_notification_channel.slack_incident_channel.name,
  ]
}

SlackのIntegration設定

SlackのIntegration部分はSlack Bot Tokenをvarから渡したが常にこのVarに依存するのが嫌なので、一回連携ができたら、以下のように lifecycle で ignore_changesを設定することで、毎回 Slack Bot Tokenを渡さなくて良くなる

resource "google_monitoring_notification_channel" "slack" {
  display_name = "slack"
  type         = "slack"
  labels = {
    "channel_name" = "#alert-channel"
  }

  # This is necessary only the first time
  # sensitive_labels {
  #   auth_token = var.slack_bot_user_oauth_token # need secret.tfvars
  # }

  # After successfully integrated, you can ignore changes for sensitive_labels
  lifecycle {
    ignore_changes = [sensitive_labels]
  }
}

Link

  1. https://cloud.google.com/stackdriver/docs/solutions/slo-monitoring/api/api-structures
  2. https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_notification_channel
  3. https://zenn.dev/sshota0809/articles/google-cloud-platform-sli-slo
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0