More than 3 years have passed since last update.

OpenShiftでProgressive Deliveryやってみた！（Argo Rollouts応用編-Prometheusのメトリクスと連携）

Posted at 2021-07-13

背景

こちらの記事は以下の記事の続きとなりますので、背景は以下の記事をご参照ください。
OpenShiftでProgressive Deliveryやってみた！（Argo Rollouts基礎編-CLI）
OpenShiftでProgressive Deliveryやってみた！（Argo Rollouts基本編-GUI）

前提

以下の記事に記載されている手順でArgo Rolloutsの導入が完了していること

OpenShiftでProgressive Deliveryやってみた！（Argo Rollouts基礎編-CLI）

以下の記事に記載されている手順でOpenShiftクラスターにPrometheus Operatorが導入されていること

OpenShift v4.6にPrometheus Operatorを導入してユーザー定義プロジェクトのモニタリングに使ってみた！

試行

検証用プロジェクトに切り替え

※OpenShift v4.6にPrometheus Operatorを導入してユーザー定義プロジェクトのモニタリングに使ってみた！の手順でprometheus-operatorにしかOperatorを導入していないため、こちらで検証。

PS D:\git> oc project prometheus-operator
Now using project "prometheus-operator" on server "https://c103-e.us-south.containers.cloud.ibm.com:31989".
PS D:\git>

AnalysisTemplateのマニフェストファイルを作成

- 以下の設定でAnalysisTemplateのリソースを定義
    - successCondition: 以下のqueryの結果が5回以下
    - provider:prometheus
        - address: [OpenShift v4.6にPrometheus Operatorを導入してユーザー定義プロジェクトのモニタリングに使ってみた！](https://qiita.com/strada/items/23bccc596be587fe003c)の手順で公開したPrometheus Serviceのエンドポイント
        - query: 以下の条件に合致するHTTPリクエストの増分
            - HTTP応答コードが200番以外
            - namespace="kubota-test"のservice="rollout-canary"が対象

prometheus-anlysis-template.yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: failure-count
spec:
  metrics:
  - name: http_error_count
    successCondition: result[0] <= 5
    provider:
      prometheus:
        address: "http://prometheus.prometheus-operator.svc:9090"
        query: |
          delta(http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}[5m])

Rollout、Service、ServiceMonitorのマニフェストファイルを作成

以下の設定でRolloutのリソースを定義
- strategy: canary
  - steps:
    - 20%をcanaryに振る
    - 一時停止
    - promoteされたらfailure-countのAnalysisTemplateの分析を実行
    - 成功したら40%をcanaryに振る
    - 40秒待機
    - 60%をcanaryに振る
    - 20秒待機
    - 80%をcanaryに振る
    - 20秒待機
    - 100%をcanaryに振り、Rollout完了し、canary->stableになる。

rollout-canary.yaml

# This example demonstrates a Rollout using the canary update strategy with a customized rollout
# plan. The prescribed steps initially sets a canary weight of 20%, then pauses indefinitely. Once
# resumed, the rollout performs a gradual, automated 20% weight increase until it reaches 100%.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollout-canary
spec:
  replicas: 5
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: rollout-canary
  template:
    metadata:
      labels:
        app: rollout-canary
    spec:
      containers:
      - name: rollouts-demo
        image: quay.io/brancz/prometheus-example-app:v0.2.0
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
  strategy:
    canary:
      steps:
      - setWeight: 20
      # The following pause step will pause the rollout indefinitely until manually resumed.
      # Rollouts can be manually resumed by running `kubectl argo rollouts promote ROLLOUT`
      - pause: {}
      - analysis:
          templates:
          - templateName: failure-count
      - setWeight: 40
      - pause: {duration: 40s}
      - setWeight: 60
      - pause: {duration: 20s}
      - setWeight: 80
      - pause: {duration: 20s}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: rollout-canary
  name: rollout-canary
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
    name: web
  selector:
    app: rollout-canary
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    team: frontend
    k8s-app: rollout-canary
  name: rollout-canary
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  selector:
    matchLabels:
      app: rollout-canary

マニフェストファイルを適用

PS D:\git> oc apply -f .\prometheus-anlysis-template.yaml
analysistemplate.argoproj.io/failure-count created
PS D:\git> oc apply -f .\rollout-canary.yaml
rollout.argoproj.io/rollout-canary created
service/rollout-canary created
servicemonitor.monitoring.coreos.com/rollout-canary created
PS D:\git>

UI Dashboardを起動

PS D:\git> docker run -p 3100:3100 -v C:\Users\YASUYUKIKUBOTA\.kube\config:/.kube/config quay.io/argoproj/kubectl-argo-rollouts:master dashboard --insecure-skip-tls-verify
time="2021-07-13T07:14:38Z" level=info msg="Argo Rollouts Dashboard is now available at localhost 3100"

DashboardにアクセスしRolloutを表示する

localhost:3100にアクセスし、NSで作成した検証用プロジェクトを選択

初回のRolloutは分析無しで完了します。

サンプルアプリケーションにアクセス

以下のコマンドを実行し、rollout-canary Serviceの8080ポートをlocalhostに転送

PS D:\git> oc port-forward svc/rollout-canary 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

localhost:8080/errに2回ほどアクセス

※画面には何も表示されないでOKです。

Prometheus UIにアクセス

以下のコマンドを実行し、Prometheus Serviceの9090ポートをlocalhostに転送

PS D:\git> oc port-forward svc/prometheus 9090:9090 -n prometheus-operator
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090

localhost:9090にアクセスし以下のクエリーを実行

http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}

結果に/errにアクセスした回数が表示されればOKです。

AnalysisTemplateで使用する実際のクエリーを実行

delta(http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}[5m])

直近5分間の404エラーの増分がないので0と表示されるはずです。
※アクセスしたタイミングがPrometheusのScrape前後だった場合は1と表示されることもあります。

Rolloutを実行（正常系）

Dashboard(localhost:3100)にアクセスしContainersのrollout-demoのタグをv0.3.0へ変更しRolloutを開始
- 詳細手順はOpenShiftでProgressive Deliveryやってみた！（Argo Rollouts基本編-GUI）を参照
20%がcanaryに割り振られたところで一時停止したらPROMOTEをクリック

以下のようにAnalysis Runが成功し、しばらくするとRolloutが完了

Rolloutを実行（分析で失敗し自動ロールバック）

Dashboard(localhost:3100)にアクセスしContainersのrollout-demoのタグをv0.2.0へ変更しRolloutを開始
20%がcanaryに割り振られたところで一時停止したらサンプルアプリケーション(localhost:8080)の/errに5回ほどアクセスする
PrometheusのScrapeが完了するまで30秒ほど待機してから、再びサンプルアプリケーション(localhost:8080)の/errに5回ほどアクセスする
再度PrometheusのScrapeが完了するまで30秒ほど待機してから、Prometheus UI(localhost:9090)で以下のクエリーを実行

delta(http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}[5m])

このクエリーの結果が5以下の場合はAnalysis Runが成功するので、6以上になっていることを確認します。

Dashboard(localhost:3100)にアクセスしPROMOTEをクリックするとAnalysis Runが失敗する

すぐにロールバックが実行され、Revision 2の状態に戻る

まとめ

Argo RolloutsのAnalysisTemplateでPrometheusのクエリーを指定して取得した結果を元にRolloutをPromoteするのか、Abortさせるのかを自動的に判断することが出来ることがわかりました。
ただ、今回の試行により以下の課題があることもわかりました

Argo RolloutsがPrometheusの認証なしHTTPエンドポイントにしか対応していないため、OpenShift v4.6のユーザー定義プロジェクトのモニタリングとは連携できないので別途Prometheus Operatorの導入が必要
PrometheusのメトリクスはcanaryのPodのみから取得したいがクエリーで選別する方法が不明
- ServiceMonitorではなくPodMonitorを使用すれば解決する？

実際に現場で使用するためにはこれらの課題をクリアしないといけませんが、最低限の動作は確認できたので今後このエリアに興味のあるお客様がいれば積極的に提案していこうと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up