More than 1 year has passed since last update.

LIFULL Advent Calendar 2022

@hir00in

株式会社LIFULL

OpenTelemetryとGrafanaでLogsとMetricsとTracesを接続する

Last updated at 2022-12-19Posted at 2022-12-19

OpenTelemetryとGrafanaを利用してPrimary Signalsを接続してみました
この記事で紹介したものはDocker Composeで必要なソフトウェアが起動するようにしてGitHubにpushしてあります

はじめに

CNCF TAG Observabilityのwhitepapperでも紹介されている
以下のPrimary Signalsは可観測性を語る上では欠かせない要素になっています

DatadogやSentry、Splunkなど各種プラットフォームでもLogsとTracesの接続など
このPrimary Signals同士を互いに参照させることで優れた体験を得ることが出来ます

一概に対抗というわけではないですがGrafana Labsでもこの領域ではアクティブに活動していてソフトウェアは多岐に渡ります
LGTM stackと呼ばれるようにログにはLoki、可視化のプラットフォームとしてのGrafana、分散トレーシングのバックエンドとしてのTempo、Prometheusが扱うメトリクスを長期保存し参照可能にするMimirといったように幅広く可観測性向上のために取り組みアクティブに更新され魅力的な機能に溢れています

それとOpenTelemetry

以前はObservabilityのバックエンドにデータを送信するための標準化されたデータ形式がなくJaegerやZipkinなどに向けた計測ライブラリやエージェントを実装する必要があり、データを送信するバックエンドを切り替える際にコードの修正が必要になってしまっていました
そこでテレメトリデータのオープンスタンダードを提供しユーザの負担を抑えようとしているのがOpenTelemetryのコミュニティです
さらにOpenTelemetryの関心事はTracesだけにはとどまらずMetrics、Logsの標準化も視野にいれて活動しています。
OpenTelemetryでは各言語のSDKが提供され豊富な計測用のライブラリなども利用することが出来ます

では、Primary Signalsが計測され、それらが連携されることでどういった体験が得られるでしょうか。
OpenTelemetryとGrafana Labsの各種ソフトウェアを利用してこれらのSignalsを連携し検証してみました

構成

作成したものの構成ですが以下のようになっています

Goの参考実装にはOpenTelemetryの実装とpromhttp、それからInstrumentationの検証としてgormのInstrumentationを実装しています

GrafanaのDatasourceにはLoki,Prometheus,Tempoを設定しGrafanaから各Signalsを閲覧できるように整備しています

準備

次を実行すると各ソフトウェアが起動します

$ docker-compose up

次に、ログを作成するためにリクエストを流していきます
別に何でもいいのですがk6を使ってリクエストをしていきます

k6もまたGrafana Labsが開発している負荷テスト用のソフトウェアです

k6 x TempoによってPerformance Testを行った結果に加え、分散トレーシングの結果を確認していくことでより解像度が増すことも紹介されています

k6をインストールし次を実行するとscript.jsに書かれたエンドポイントに対してリクエストが流れていきます

$ k6 run script.js

次はEnvoyの紹介ですが、Goの参考実装の前段にいるEnvoyはアクセスログを収集しています
まだOpenTelemetry tracerはWIPということですがEnvoyにHTTP tracerを設定してありtraceparentヘッダをロギングしています

infra/envoy/config/envoy.yaml

              tracing:
                provider:
                  name: envoy.tracers.opentelemetry
                  typed_config:
                    "@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
                    grpc_service:
                      envoy_grpc:
                        cluster_name: opentelemetry_collector
                      timeout: 3s
                    service_name: envoy

それでは各Signal同士の連携を確認していきます

Signalの連携

Logs to Traces

前述のEnvoyによってリクエストにtraceparentが記録されます

LogsをGrafanaから参照する際にtraceparentヘッダをparseしてTempoへのリンクを表示するGrafanaのderived fieldsに設定を行っています

infra/grafana/datasources/datasource.yaml

  - name: 'Loki'
    type: 'loki'
    access: 'proxy'
    orgId: 1
    version: 1
    url: 'http://${LOKI_HOST}:${LOKI_PORT}'
    basicAuth: false
    isDefault: false
    readOnly: false
    editable: false
    uid: 'loki'
    jsonData:
      derivedFields:
        - datasourceUid: 'tempo'
          matcherRegex: '"(?:trace_id|traceID|^traceparent)":"(\w+)"'
          name: 'TraceID'
          url: '$${__value.raw}'
        - datasourceUid: tempo
          matcherRegex: '"traceparent":"[\d]{2}-([\da-f]{32})-[\da-f]{16}-[\da]{2}"'
          name: TraceIDInTraceparent
          url: $${__value.raw}

するとこのようにアクセスログに含まれるtraceIdからTempoに送信されたTracesを閲覧することができます

LogsからTracesが参照できることで、遅くなっているリクエストはどの部分にボトルネックを抱えているかを明らかにすることが出来ます。こうして解像度をあげていきます。例えば問題が同期的な通信の結果であれば並列処理を検討したり次の改善計画が立てやすくなります

Traces to Logs

続いてTracesからLogsへの接続ですがGrafana datasourceの設定にtracesToLogsという設定があります
こちらをTempoに対して設定します

infra/grafana/datasources/datasource.yaml

  - name: 'Tempo'
    type: 'tempo'
    access: 'proxy'
    orgId: 1
    version: 1
    apiVersion: 1
    url: 'http://${TEMPO_HOST}:${TEMPO_PORT}'
    basicAuth: false
    isDefault: false
    readOnly: false
    editable: false
    uid: 'tempo'
    jsonData:
      lokiSearch:
        datasourceUid: 'loki'
      tracesToLogs:
        datasourceUid: 'loki'
        tags: ['container']
        mappedTags: [{ key: 'service.name', value: 'workload' }]
        mapTagNamesEnabled: true
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        filterByTraceID: true
        filterBySpanID: false

この設定によって閲覧したTracesからLogsへの遷移をサポートするためのButtonが出現します
それではTracesを検索していきます

GrafanaにはtempoSearchというfeature_toggleが用意されていてこちらを有効化しています
k6を使って流したリクエストのエンドポイントではランダムでsleepを入れる処理をいれるようにしてあり次の例では5000ms以上かかっているTracesを検索しています

TempoのTace Viewの結果にはLogs for this spanというボタンが出現しています
ここからLokiに渡すクエリが生成されて遷移がサポートされます
例ではTracesに含まれるservice.nameの値をworkloadというLabelに代入しtrace_idをFilterしてログをHitさせています

このtrace_idをaccess_logやerror_logなどにも含めておくことでひとまとめにして追うことができます。参考実装ではrequest contextから取り出したSpan Contextからerror_logに対して記録するような処理を入れています

ここでjsのinstrumentationを覗いてみるとwinstonの実装を拡張して透過的にtrace_id,span_id,trace_flagsを記録してくれたりします。registerInstrumentationsにWinstonInstrumentationを登録しておくだけなのでとても簡単に拡張してくれますね。

Metrics to Traces

こちらにはExemplarを利用しています

Metricsのグラフに描画されたpointerにhoverするとTracesへのリンクがついたモーダルが出現しMetricsとTracesをつなげる事ができます

参考実装の/metricsエンドポイントを除いてみるとこのようになっていて一部のMetricsとTraceIdが対になって記録されているのが分かります

❯ curl -H "Accept: application/openmetrics-text" localhost:8080/metrics
# HELP http_request_duration_seconds Duration of all HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="0.005"} 99 # {TraceID="4a6f844d7b171af54d1d2a9c9448d685"} 0.000463 1.6710666554619744e+09
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="0.01"} 99
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="0.025"} 99
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="0.05"} 99
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="0.1"} 99
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="0.25"} 99
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="0.5"} 99
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="1.0"} 99
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="2.5"} 279 # {TraceID="e466e0c81efae9705b70f15e40668d0e"} 1.001827501 1.671066648410268e+09
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="5.0"} 459 # {TraceID="f949aaa6558a9fbfd490b958d0d57d78"} 3.001288709 1.6710666564504805e+09
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="10.0"} 911 # {TraceID="108c6a9cd29b9104492474e012e76a16"} 7.001952045 1.6710666634743974e+09
http_request_duration_seconds_bucket{code="200",handler="random-sleep",method="get",le="+Inf"} 911
http_request_duration_seconds_sum{code="200",handler="random-sleep",method="get"} 4077.941841458995
http_request_duration_seconds_count{code="200",handler="random-sleep",method="get"} 911
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="0.005"} 3 # {TraceID="949396202ebcd19b0deaee6ee0e9477d"} 4.0292e-05 1.6710721438980134e+09
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="0.01"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="0.025"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="0.05"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="0.1"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="0.25"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="0.5"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="1.0"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="2.5"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="5.0"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="10.0"} 3
http_request_duration_seconds_bucket{code="200",handler="top",method="get",le="+Inf"} 3
http_request_duration_seconds_sum{code="200",handler="top",method="get"} 0.000126542
http_request_duration_seconds_count{code="200",handler="top",method="get"} 3
# HELP http_requests_in_flight A gauge of requests currently being served by the wrapped handler.
# TYPE http_requests_in_flight gauge
http_requests_in_flight 0.0
# HELP http_requests Count of all HTTP requests
# TYPE http_requests counter
http_requests_total{code="200",handler="random-sleep",method="get"} 911.0 # {TraceID="108c6a9cd29b9104492474e012e76a16"} 1.0 1.671066663474393e+09
http_requests_total{code="200",handler="top",method="get"} 3.0 # {TraceID="949396202ebcd19b0deaee6ee0e9477d"} 1.0 1.6710721438980112e+09
# HELP response_size_bytes A histogram of response sizes for requests.
# TYPE response_size_bytes histogram
response_size_bytes_bucket{le="200.0"} 914 # {TraceID="949396202ebcd19b0deaee6ee0e9477d"} 2.0 1.671072143897989e+09
response_size_bytes_bucket{le="500.0"} 914
response_size_bytes_bucket{le="900.0"} 914
response_size_bytes_bucket{le="1500.0"} 914
response_size_bytes_bucket{le="+Inf"} 914
response_size_bytes_sum 65598.0
response_size_bytes_count 914
# EOF

参考実装ではpromhttpを利用していてWithExemplarFromContextを使ってExemplarの設定を行っています

他と比べてLatencyが高くなっているリクエストはMetricsから見ると明らかで、それがTracesへと連携されているのは非常に使い勝手が良いです。前述のtempoSearchは便利な機能でApache Parquetをbackendにすることで高速な検索を実現しようとしています。こちらで一定の利用シーンはまかなえそうですがmax_durationなどの設定によっては限られた範囲の中でしか検索できません。広いtime rangeを対象にした検索はMetricsの得意とすることなので得意なことは得意なものに任せるのが良いのかなと思います

Traces to Metrics

こちらにはTrace to meticsの機能を利用しています

infra/grafana/datasources/datasource.yaml

      tracesToMetrics:
        datasourceUid: 'Prometheus'
        tags: [{ key: 'service.name', value: 'workload' }]
        spanStartTimeShift: '-10m'
        spanEndTimeShift: '10m'
        queries:
          - name: 'p99 latency'
            query: histogram_quantile(.99, sum(rate(http_request_duration_seconds_bucket{$$__tags}[1m])) by (le))

PromeQLは自由に設定できますが、例としてp99のlatencyを閲覧できるようにしてみました
TraceViewを見たあと、Metricsから全体感を眺めるといったことが出来ます

この話題に関連するものとしてTempoにはMetrics-generatorという機能があり
これを利用することで送信されたSpanからメトリクスが作成されます

参考実装ではotelhttpが送信するSpanにhandlerを特定するための名前をつけています
TempoのServiceGraphタブからFilterをかけて対象を抽出すると左のようなAPM Tableが表示されます。そこから特定のリクエストのp90と連携するといったことも可能です

Metrics to Logs, Logs to Metrics

これが強い関係性だという主張は難しいのですが強いて言えばGrafanaのSplit Viewを使い、
timerangeを同期させることでmetricsとlogsが連携されます。
また、これまで紹介してきたものを考えると少し遠いですがMetrics to Traces to Logs, Logs to Traces to Metricsといったように間接的には連携されます

最後に

OpenTelemetryとGrafanaを利用してPrimary Signalsを接続してみました
LogsからTracesがつながる体験やMeticsとTracesがつながる体験は開発者としては整備したいところで意図せず発生する問題の解像度をあげてくれることに役立つと思います

余談ですがObservabilityCON 2022でお披露目されたcontinuous profilingのGrafana Phlare、real user monitoringのGrafana Faroも加えるとさらにパワーアップした可観測性が得られますね。これらも楽しみです

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up