Overview
Cloud Runを使ってサービスをデプロイするのはとても簡単だが、SLO monitoringも簡単に作れることがわかったのでメモ
今回はすでにCloud Run Serviceがデプロイされている状態を想定
手順
Monitoring Serviceの設定: google_monitoring_service
次にgoogle_monitoring_serviceを定義
resource "google_monitoring_service" "sample" {
service_id = "sample"
display_name = "Cloud Run SLO sample"
basic_service {
service_type = "CLOUD_RUN"
service_labels = {
service_name = "sample" # Cloud Run service name
location = "asia-northeast1" # Cloud Run service location
}
}
}
サポートされているbasic_service
はいかの4つ (ref: SLO Monitoring)
APP_ENGINE
CLOUD_ENDPOINTS
CLUSTER_ISTIO
ISTIO_CANONICAL_SERVICE
CLOUD_RUN
今回はCLOUD_RUNを指定して、service_nameとlocationを指定。
SLOの設定: google_monitoring_slo
resource "google_monitoring_slo" "sample_availability" {
service = google_monitoring_service.sample.service_id
slo_id = "sample-availability"
display_name = "Availability of Cloud Run sample"
goal = 0.99
rolling_period_days = 30
basic_sli {
availability {
enabled = true
}
}
}
もしもrequest baseで自分で設定したい場合には、以下のように設定も可能:
request_based_sli
resource "google_monitoring_slo" "sample_availability" {
service = google_monitoring_service.sample.service_id
slo_id = "sample-availability"
display_name = "Availability of Cloud Run sample"
goal = 0.99
rolling_period_days = 30
request_based_sli {
good_total_ratio {
good_service_filter = join(" AND ", [
"metric.type=\"run.googleapis.com/request_count\"",
"resource.type=\"cloud_run_revision\"",
"resource.label.service_name=\"sample\"",
"metric.label.response_code_class=\"2xx\""
])
total_service_filter = join(" AND ", [
"metric.type=\"run.googleapis.com/request_count\"",
"resource.type=\"cloud_run_revision\"",
"resource.label.service_name=\"sample\"",
])
}
}
}
定義:
- SLO: 99%
- rolling_period_days: 30日
- SLI: (status code 200のリクエストカウント) / (すべてのリクエストカウント)
availabilityの他にもlatencyのSLOを同じサービスに追加することも可
basic_sli {
latency {
threshold = "1s"
}
}
SLO burn rate alertの設定
Slackの連携 (もしもすでに連携済みであればスキップ)
Slack連携する際は https://api.slack.com/apps を作っておいて、OAuth & PermissionsからBot User OAuth Tokenを取得して secret.tfvars
などにいれる (またはSecretManagerに入れておいて参照する)
variable "slack_bot_user_oauth_token" {
description = "slack bot user token"
type = string
sensitive = true
default = "dummy"
}
slack_bot_user_oauth_token = "xoxb-aaaaaaaaaaa"
google_monitoring_notification_channel
でSlack連携をする
resource "google_monitoring_notification_channel" "slack" {
display_name = "slack"
type = "slack"
labels = {
"channel_name" = "#alert-channel"
}
sensitive_labels {
auth_token = var.slack_bot_user_oauth_token # need secret.tfvars
}
}
Slack appは連携するチャンネルに事前に追加しておく必要がある
monitoring_alert_policyの定義
resource "google_monitoring_alert_policy" "availability_long_window" {
display_name = "SLO burn rate alert"
combiner = "AND"
conditions {
display_name = "SLO burn rate with long window"
condition_threshold {
filter = "select_slo_burn_rate(\"${google_monitoring_slo.sample_availability.name}\", 60m)"
threshold_value = "10"
duration = "0s"
comparison = "COMPARISON_GT"
}
}
notification_channels = [google_monitoring_notification_channel.slack.name]
}
これでCloud Run serviceのSLOタブで確認できるようになる
おまけ
multi-window multi-burn rateの設定
少し冗長ですが、以下のようにavailabilityとlatencyのmulti-window multi-burn rateアラートの設定をすることができます。
locals {
slo_alert_policies = {
for policy in [
{
service_name = "sample"
type = "availability"
long_window = "60m"
short_window = "5m"
threshold = "14.4"
slo_name = google_monitoring_slo.sample_availability.name
},
{
service_name = "sample"
type = "availability"
long_window = "6h"
short_window = "30m"
threshold = "6"
slo_name = google_monitoring_slo.sample_availability.name
},
{
service_name = "sample"
type = "latency"
long_window = "60m"
short_window = "5m"
threshold = "14.4"
slo_name = google_monitoring_slo.sample_latency.name
},
{
service_name = "sample"
type = "latency"
long_window = "6h"
short_window = "30m"
threshold = "6"
slo_name = google_monitoring_slo.sample_latency.name
},
] : "${policy.service_name} ${policy.type} - ${policy.threshold}" => policy
}
}
resource "google_monitoring_alert_policy" "cloud_run_slo_burn_rate" {
for_each = local.slo_alert_policies
display_name = each.key
combiner = "AND"
conditions {
display_name = "SLO burn rate with short window"
condition_threshold {
filter = "select_slo_burn_rate(\"${each.value["slo_name"]}\", ${each.value["short_window"]})"
threshold_value = each.value["threshold"]
duration = "0s"
comparison = "COMPARISON_GT"
}
}
conditions {
display_name = "SLO burn rate with long window"
condition_threshold {
filter = "select_slo_burn_rate(\"${each.value["slo_name"]}\", ${each.value["long_window"]})"
threshold_value = each.value["threshold"]
duration = "0s"
comparison = "COMPARISON_GT"
}
}
enabled = true
notification_channels = [
data.google_monitoring_notification_channel.slack_incident_channel.name,
]
}
SlackのIntegration設定
SlackのIntegration部分はSlack Bot Tokenをvarから渡したが常にこのVarに依存するのが嫌なので、一回連携ができたら、以下のように lifecycle で ignore_changesを設定することで、毎回 Slack Bot Tokenを渡さなくて良くなる
resource "google_monitoring_notification_channel" "slack" {
display_name = "slack"
type = "slack"
labels = {
"channel_name" = "#alert-channel"
}
# This is necessary only the first time
# sensitive_labels {
# auth_token = var.slack_bot_user_oauth_token # need secret.tfvars
# }
# After successfully integrated, you can ignore changes for sensitive_labels
lifecycle {
ignore_changes = [sensitive_labels]
}
}