More than 5 years have passed since last update.

Stackdriver MonitoringのTotal Latencyメトリクスがどう集計されているのか解明してみる

Posted at 2020-02-09

TL; DR

https/total_latencies メトリクスは DISTRIBUTION型で、集計元のデータが既にヒストグラム
ヒストグラムのALIGN_SUMはヒストグラムのマージ（だと思われる）
https/total_latenciesの集計はALIGN_SUM+REDUCE_PERCENTILE_99が良さそう

事の発端

Stackdriver MonitoringのDashboardsは標準でGoogle Cloud Load Balancersなどのダッシュボードを用意してくれて、レスポンスタイム(Total Latency)やステータスコード(Response by Response Code Class)などが見える

用意されている_Total Latency_の設定はこんな感じになっており、

これを~~丸コピ~~参考にしてアラートポリシーを作っていた（最初にWebConsole上で作って terraform import してゴニョゴニョして.tfにしていた。泥臭…）

が、どうも挙動がおかしい

閾値以上のlatencyが検知されてアラートが鳴るが、stackdriver以外のログをどう調べても遅いレスポンスがあったように見えない

で、調べてみたら

Aggregator と Aligner の設定が逆になってた

直せばチャートは正しそうっぽい感じの雰囲気が出てきたけど、何がどうなってたかよくわからなかったのでちゃんと調べてみた

特に Aligner : sum が激しく謎

レスポンスタイム足し合わせたらダメでは？？？

調査結果

結論：ダメじゃなさそうだった

`https/total_latencies`の型

Google Cloud metrics | Stackdriver Monitoring

https/total_latencies は_データの種類_がDELTAで_データの型_が**DISTRIBUTION型**

ナニソレ

`Distribution`型とは

TimeSeries | Stackdriver Monitoring | Google Cloud

Distribution contains summary statistics for a population of values. It optionally contains a histogram representing the distribution of those values across a set of buckets.

こんなフォーマットらしい

JSON representation

{
  "count": string,
  "mean": number,
  "sumOfSquaredDeviation": number,
  "range": {
    object (Range)
  },
  "bucketOptions": {
    object (BucketOptions)
  },
  "bucketCounts": [
    string
  ],
  "exemplars": [
    {
      object (Exemplar)
    }
  ]
}

ドユコト

`DISTRIBUTION`型の`https/total_latencies`のデータを直接見てみる

Method: projects.timeSeries.list | Stackdriver Monitoring

Google API Explorerで projects.timeSeries.list を取得

name : projectId

filter : （実際は改行じゃなく空白区切り）

metric.type="loadbalancing.googleapis.com/https/total_latencies"
resource.type="https_lb_rule"
resource.label."url_map_name"="対象のurl_map_name"
resource.label."project_id"="対象プロジェクトID"

interval.startTime/interval.endTime : RFC3339 (ex. 2020-02-09T14:00:00+09:00)

結果

{
  "timeSeries": [
    {
      "metric": {
        "labels": {
          "client_country": "Japan",
          "response_code_class": "200",
          "protocol": "HTTP/1.1",
          "response_code": "200",
          "cache_result": "DISABLED",
          "proxy_continent": "Asia"
        },
        "type": "loadbalancing.googleapis.com/https/total_latencies"
      },
      "resource": {
        "type": "https_lb_rule",
        "labels": {
          "backend_target_type": "BACKEND_SERVICE",
          "project_id": "プロジェクトID",
          "backend_type": "INSTANCE_GROUP",
          "forwarding_rule_name": "k8s-fw-default-my-ingress--dba157e117f099a8",
          "backend_target_name": "",
          "backend_name": "",
          "backend_scope": "asia-northeast1-b",
          "matched_url_path_rule": "/",
          "region": "global",
          "url_map_name": "k8s-um-default-my-ingress--dba157e117f099a8",
          "backend_scope_type": "ZONE",
          "target_proxy_name": "k8s-tp-default-my-ingress--dba157e117f099a8"
        }
      },
      "metricKind": "DELTA",
      "valueType": "DISTRIBUTION",
      "points": [
        {
          "interval": {
            "startTime": "2020-02-09T07:02:00Z",
            "endTime": "2020-02-09T07:03:00Z"
          },
          "value": {
            "distributionValue": {
              "count": "2",
              "mean": 64,
              "bucketOptions": {
                "exponentialBuckets": {
                  "numFiniteBuckets": 66,
                  "growthFactor": 1.4,
                  "scale": 1
                }
              },
              "bucketCounts": [
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "2"
              ]
            }
          }
        }
      ]
    },
    {
        ...
    }
  ]
}

`DISTRIBUTION`型の`https/total_latencies`のデータ解読

bucketOptions: exponentialBuckets
- 各区間が指数関数的なヒストグラム
- ほかの形式としては linearBuckets, explicitBucketsがある
- ヒストグラムの各区間は scale * (growthFactor ^ (i - 1)) - scale * (growthFactor ^ i)
- scale, growthFactorは取得したデータすべて同じ値だった
- https/total_latencies では固定なのかもしれないし、もしかしたらアクセス流量によって変動するのかもしれない
bucketCounts
- bucketOptionsで定義されるヒストグラムでのデータの分布
metric, resource と points
- metric, resource : リクエストの各種ラベルでまずデータが分けられ
- それぞれさらに時刻(points)ごとに分割されている

つまり↑のデータは

scale: 1、growthFactor: 1.4、i=13のデータ数が2

→ Latencyが 1.4^12(=56.6939124)ms 〜 1.4^13(=79.3714773)ms のレスポンスが２回

読める・・・読めるぞ・・・！

`ALIGN_SUM`

REST Resource: projects.alertPolicies | Stackdriver Monitoring

まずperSeriesAlignerが適用されて時間軸(alignmentPeriod)に集約されて、

次にcrossSeriesReducerが適用されてリソース分類の軸(groupByFields)で集約される

で

perSeriesAligner : enum (Aligner)

ALIGN_SUM : Align the time series by returning the sum of the values in each alignment period. This aligner is valid for GAUGE and DELTA metrics with numeric and distribution values. The valueType of the aligned result is the same as the valueType of the input.

ALIGN_SUMの出力の型は、入力の型そのまま

入力がDISTRIBUTION型なら出力もDISTRIBUTION型

つまり、sumとはレスポンスタイムの足し算ではなく、ヒストグラムの足し合わせ（マージ？）の意味だったのだよ

（注：ドキュメントにそう明言はされてないので予想）

`REDUCE_PERCENTILE_99`

最後に「Aggregator : 99th percentile」の部分

ヒストグラムからどうパーセンタイルを算出するかがドキュメントに記載なかったので検算してみた

99th percentile

1.4^12*0.01 + 1.4^13*0.99 = 79.1447017

95th percentile

1.4^12*0.05 + 1.4^13*0.95 = 78.2375991

50th percentile

1.4^12*0.50 + 1.4^13*0.50 = 68.0326949

完全に一致

でもデータが複数のバケットに分布してるときはどう計算するのかわからん・・・

数学的に定義されるような気がする

`ALIGN_PERCENTILE_99` して `REDUCE_SUM` してしまうと？

間違って設定していたときになにが起こっていたか

ALIGN_PERCENTILE_99

Align the time series by using percentile aggregation. The resulting data point in each alignment period is the 99th percentile of all data points in the period. This aligner is valid for GAUGE and DELTA metrics with distribution values. The output is a GAUGE metric with valueType DOUBLE.

アウトプットはDOUBLE型

この時点でもうヒストグラムではなくなる

つまり

まず ALIGN_PERCENTILE_99
- 各リソースについて、 alignmentPeriod ごとに99th percentileの値＝レスポンスタイム(ms)が計算される
その後 REDUCE_SUM
- groupByFieldsに基づいて↑の結果を「合計」
- ここの合計は、本当にレスポンスタイムの値が足し算されてしまう

こっちの場合、「レスポンスタイム足し合わせたらダメでは？？？」とイメージしたことが本当に発動している

実際のところ、 resource.labels.backend_scope が "asia-northeast1-a", "-b", "-c"に散らばったときに足し算が発生し、Stackdriver上でだけ、まるで２倍遅いレスポンスタイムがあったかのように見えていた

雑記

設定ミスしたおかげでよくわかってないけど放置してしまっていたところを調べる気になって、いろいろと知らなかったことがわかった
_Total Latency_より_Backend Latency_で監視したほうがよかったりする・・・？
Latency系はデフォルトダッシュボードまるパクリで良さそうだけど、Request Count系はALIGN_RATEよりALIGN_SUMの方がよくないだろうか
WebConsole上のチャートで時間のスケール(1h ,6h ,1d, ...)を変えると、左パネルの設定を無視して自動的にalignmentPeriodが変わるから要注意
こういうデータフォーマットしてるのってモニタリング界隈では標準的なのかstackdriver独自なのか？
Datadogだと99th percentileはおろかmaxも取れずavgしかないのってみんなどうしてるんだろう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Stackdriver MonitoringのTotal Latencyメトリクスがどう集計されているのか解明してみる

TL; DR

事の発端

調査結果

https/total_latenciesの型

Distribution型とは

DISTRIBUTION型のhttps/total_latenciesのデータを直接見てみる

DISTRIBUTION型のhttps/total_latenciesのデータ解読

ALIGN_SUM

REDUCE_PERCENTILE_99

ALIGN_PERCENTILE_99 して REDUCE_SUM してしまうと？

雑記