More than 1 year has passed since last update.

GCP Cloud RunのSLO monitoringをterraformで作成

Last updated at 2024-04-11Posted at 2024-02-23

Overview

Cloud Runを使ってサービスをデプロイするのはとても簡単だが、SLO monitoringも簡単に作れることがわかったのでメモ

今回はすでにCloud Run Serviceがデプロイされている状態を想定

手順

Monitoring Serviceの設定: google_monitoring_service

次にgoogle_monitoring_serviceを定義

resource "google_monitoring_service" "sample" {
  service_id   = "sample"
  display_name = "Cloud Run SLO sample"

  basic_service {
    service_type = "CLOUD_RUN"
    service_labels = {
      service_name = "sample" # Cloud Run service name
      location     = "asia-northeast1" # Cloud Run service location
    }
  }
}

サポートされているbasic_serviceはいかの4つ (ref: SLO Monitoring)

APP_ENGINE
CLOUD_ENDPOINTS
CLUSTER_ISTIO
ISTIO_CANONICAL_SERVICE
CLOUD_RUN

今回はCLOUD_RUNを指定して、service_nameとlocationを指定。

SLOの設定: google_monitoring_slo

次にgoogle_monitoring_sloを定義

resource "google_monitoring_slo" "sample_availability" {
  service      = google_monitoring_service.sample.service_id
  slo_id       = "sample-availability"
  display_name = "Availability of Cloud Run sample"

  goal                = 0.99
  rolling_period_days = 30

  basic_sli {
    availability {
      enabled = true
    }
  }
}

もしもrequest baseで自分で設定したい場合には、以下のように設定も可能:

request_based_sli

resource "google_monitoring_slo" "sample_availability" {
  service      = google_monitoring_service.sample.service_id
  slo_id       = "sample-availability"
  display_name = "Availability of Cloud Run sample"

  goal                = 0.99
  rolling_period_days = 30

  request_based_sli {
    good_total_ratio {
      good_service_filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        "resource.label.service_name=\"sample\"",
        "metric.label.response_code_class=\"2xx\""
      ])
      total_service_filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        "resource.label.service_name=\"sample\"",
      ])
    }
  }
}

定義:

SLO: 99%
rolling_period_days: 30日
SLI: (status code 200のリクエストカウント) / (すべてのリクエストカウント)

availabilityの他にもlatencyのSLOを同じサービスに追加することも可

latencyの場合

  basic_sli {
    latency {
      threshold = "1s"
    }
  }

SLO burn rate alertの設定

Slackの連携 (もしもすでに連携済みであればスキップ)

Slack連携する際は https://api.slack.com/apps を作っておいて、OAuth & PermissionsからBot User OAuth Tokenを取得して secret.tfvarsなどにいれる (またはSecretManagerに入れておいて参照する)

variables.tf

variable "slack_bot_user_oauth_token" {
  description = "slack bot user token"
  type        = string
  sensitive   = true
  default     = "dummy"
}

secret.tfvars

slack_bot_user_oauth_token = "xoxb-aaaaaaaaaaa"

google_monitoring_notification_channel でSlack連携をする

resource "google_monitoring_notification_channel" "slack" {
  display_name = "slack"
  type         = "slack"
  labels = {
    "channel_name" = "#alert-channel"
  }
  sensitive_labels {
    auth_token = var.slack_bot_user_oauth_token # need secret.tfvars
  }
}

Slack appは連携するチャンネルに事前に追加しておく必要がある

monitoring_alert_policyの定義

resource "google_monitoring_alert_policy" "availability_long_window" {
  display_name = "SLO burn rate alert"
  combiner     = "AND"
  conditions {
    display_name = "SLO burn rate with long window"
    condition_threshold {
      filter          = "select_slo_burn_rate(\"${google_monitoring_slo.sample_availability.name}\", 60m)"
      threshold_value = "10"
      duration        = "0s"
      comparison      = "COMPARISON_GT"
    }
  }
  notification_channels = [google_monitoring_notification_channel.slack.name]
}

これでCloud Run serviceのSLOタブで確認できるようになる

おまけ

multi-window multi-burn rateの設定

少し冗長ですが、以下のようにavailabilityとlatencyのmulti-window multi-burn rateアラートの設定をすることができます。

locals {
  slo_alert_policies = {
    for policy in [
      {
        service_name = "sample"
        type         = "availability"
        long_window  = "60m"
        short_window = "5m"
        threshold    = "14.4"
        slo_name     = google_monitoring_slo.sample_availability.name
      },
      {
        service_name = "sample"
        type         = "availability"
        long_window  = "6h"
        short_window = "30m"
        threshold    = "6"
        slo_name     = google_monitoring_slo.sample_availability.name
      },
      {
        service_name = "sample"
        type         = "latency"
        long_window  = "60m"
        short_window = "5m"
        threshold    = "14.4"
        slo_name     = google_monitoring_slo.sample_latency.name
      },
      {
        service_name = "sample"
        type         = "latency"
        long_window  = "6h"
        short_window = "30m"
        threshold    = "6"
        slo_name     = google_monitoring_slo.sample_latency.name
      },
    ] : "${policy.service_name} ${policy.type} - ${policy.threshold}" => policy
  }
}
resource "google_monitoring_alert_policy" "cloud_run_slo_burn_rate" {
  for_each     = local.slo_alert_policies
  display_name = each.key
  combiner     = "AND"
  conditions {
    display_name = "SLO burn rate with short window"
    condition_threshold {
      filter          = "select_slo_burn_rate(\"${each.value["slo_name"]}\", ${each.value["short_window"]})"
      threshold_value = each.value["threshold"]
      duration        = "0s"
      comparison      = "COMPARISON_GT"
    }
  }

  conditions {
    display_name = "SLO burn rate with long window"
    condition_threshold {
      filter          = "select_slo_burn_rate(\"${each.value["slo_name"]}\", ${each.value["long_window"]})"
      threshold_value = each.value["threshold"]
      duration        = "0s"
      comparison      = "COMPARISON_GT"
    }
  }
  enabled = true
  notification_channels = [
    data.google_monitoring_notification_channel.slack_incident_channel.name,
  ]
}

SlackのIntegration設定

SlackのIntegration部分はSlack Bot Tokenをvarから渡したが常にこのVarに依存するのが嫌なので、一回連携ができたら、以下のように lifecycle で ignore_changesを設定することで、毎回 Slack Bot Tokenを渡さなくて良くなる

resource "google_monitoring_notification_channel" "slack" {
  display_name = "slack"
  type         = "slack"
  labels = {
    "channel_name" = "#alert-channel"
  }

  # This is necessary only the first time
  # sensitive_labels {
  #   auth_token = var.slack_bot_user_oauth_token # need secret.tfvars
  # }

  # After successfully integrated, you can ignore changes for sensitive_labels
  lifecycle {
    ignore_changes = [sensitive_labels]
  }
}

Link

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up