More than 3 years have passed since last update.

node-problem-detector: 任意の監視スクリプトで Kubernetes ノードの Condition を変更する

Last updated at 2020-08-05Posted at 2020-08-05

はじめに

ここでは、任意のスクリプトを実行してノードの Condition の変更や Event の作成を行える node-problem-detector（NPD）のカスタムプラグインについて説明します。なお、ここで使われている NPD の設定ファイルは v0.8.2 で動作を確認しています。

node-problem-detector とは

node-problem-detector は DaemonSet で各ノードで実行され、ノードの問題を発見したら Kubernetes ノードの Condition の変更や Event を作成するデーモンです。

ノードの問題を発見する手段として次のプラグインが提供されています。

システムログモニタ: ログファイルを監視して問題を発見する
- filelog プラグイン: 任意のログファイル
- journald プラグイン: journald のログ
- kmsg プラグイン: カーネルログ(/dev/kmsg)
カスタムプラグインモニタ: 任意の監視スクリプトで問題を発見する
- custom プラグイン: 任意の監視スクリプトの標準出力と終了コードで問題を通知する

システムログモニタの解説は [Kubernetes] node-problem-detector で node の監視をしてみる | AI tech studio が参考になります。

また、メトリクスの出力もサポートしています。

システム統計モニタ: CPU やディスク、メモリといったコンポーネントの統計情報を出力する

ここでは、これらのなかでカスタムプラグインモニタのカスタムプラグインについて扱います。

カスタムプラグインでできること

カスタムプラグインモニタは、任意の監視スクリプトを実行し、監視スクリプトの終了コードと標準出力で問題を発見したかどうかを判断します。そのため、基本的には何でもできるということなのですが、監視スクリプトは NPD のコンテナのなかで実行されるため、NPD のコンテナイメージに実行に必要なものが含まれている必要があります。そのため、多くはシェルスクリプトで実装されていますが、Go 言語も相性がよさそうです。Python などで実装したい場合は NPD イメージをベースイメージとして Python を追加したイメージを独自でビルドする必要があります。

カスタムプラグインの設定

カスタムプラグインの設定は、次のような JSON 形式のファイルで記述します。

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3,
    "enable_message_change_based_condition_update": false
  },
  "source": "upfile-monitor",
  "metricsReporting": true,
  "conditions": [
    {
      "type": "UpFile",
      "reason": "UpFileExists",
      "message": "up file exists"
    }
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "UPFileDoesNotExist",
      "path": "/custom-config/plugin/check_up_file.sh",
      "timeout": "3s"
    },
    {
      "type": "permanent",
      "condition": "UpFile",
      "reason": "UpFileDoesNotExist",
      "path": "/custom-config/plugin/check_up_file.sh",
      "timeout": "3s"
    }
  ]
}

plugin は、プラグインの種類を記述します。カスタムプラグインの場合は custom です。

pluginConfig は、プラグインの設定です。設定できる項目はプラグインの種類毎に異なります。

invoke_interval: カスタムプラグインが呼び出される間隔。デフォルトは 30s。
timeout: カスタムプラグインの呼び出しが終了してタイムアウトと見なされるまでの時間。タイムアウトするとステータス Unknown の扱いとなる。デフォルトは 5s。
max_output_length: カスタムプラグインの標準出力からの出力をカットするサイズ。カット後の出力がコンディションのステータスメッセージとして使われる。デフォルトは 8。
concurrency: プラグインのワーカ数。1つのカスタムプラグインのなかで複数のルールがある場合にそれらの同時呼び出し数に当たる。デフォルトは 3。
enable_message_change_based_condition_update: メッセージ（カスタムプラグインの標準出力からの出力）の変更によりコンディションを更新するかどうか。デフォルトは false。

source は、Event が作成された際の報告元のコンポーネントとして使われます。具体的には Event オブジェクトの .source.component フィールドの値として使われます。

metricsReporting は、Prometheus メトリクスとして出力するかどうかの設定です。true に設定した場合は、NPD の Prometheus メトリクスに含まれるようになります。

# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="UPFileDoesNotExist"} 17
problem_counter{reason="UpFileDoesNotExist"} 1
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="UpFileDoesNotExist",type="UpFile"} 1
problem_gauge{reason="UpFileExists",type="UpFile"} 0

conditions は、デフォルトの Condition の設定です。正常な状態での Condition を記述します。ここで設定した Condition は Status が False で設定されます。

rules は、監視スクリプトを実行ルールの設定です。

type: temporary または permanent のどちらか
- temporary: Event の作成
- permanent: ノード Condition の変更
reason: Event およびノードの Condition の Reason に使われる文字列
path: 実行する監視スクリプトのパス
timeout: /pluginConfig/timeout の設定を上書きする設定

監視スクリプトを実装する

監視スクリプトの実装は、NPD と連携するために次のルールに従う必要があります。

監視スクリプトの終了コードは、問題を発見したかどうかに応じて変更します。

0: OK. 問題がない場合
- temporary: Event が作成されない
- permanent: ノードの Condition が False に設定される
1: NONOK. 問題がある場合
- temporary: Event が作成される（Severity は Warning）
- permanent: ノードの Condition が True に設定される
2: Unknown. そのほか
- temporary: Event が作成される（Severity は Warning）
- permanent: ノードの Condition が Unknown に設定される

標準出力は、メッセージに使われます。なお、カスタムプラグインの設定の /pluginConfig/max_output_length で設定した文字列長の指定より長い場合はカットされます。

temporary: Event message フィールドの文字列
permanent: ノードの Condition の message フィールドの文字列

下記は、ノード上に /custom-data/up ファイルが存在するかどうかをチェックする監視スクリプトの例です。

# !/usr/bin/env bash

set -e -o pipefail; [[ -n "$DEBUG" ]] && set -x

readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

readonly UP_FILE="/custom-data/up"

if [[ -f "$UP_FILE" ]]; then
  echo "$UP_FILE exists"
  exit $OK
fi

echo "$UP_FILE does not exist"
exit $NONOK
# vim: ai ts=2 sw=2 et sts=2 ft=sh

これまでのカスタムプラグインの設定と監視スクリプトで作成される Event とノードの Condition は次のようになります。

Event

$ kubectl get events
LAST SEEN   TYPE      REASON               OBJECT          MESSAGE
46m         Normal    UpFileDoesNotExist   node/minikube   Node condition UpFile is now: True, reason: UpFileDoesNotExist
116s        Warning   UPFileDoesNotExist   node/minikube   /custom-data/up does not exist

ノードの Condition

$ kubectl describe node minikube | grep -A 9 Conditions
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  KernelDeadlock       False   Wed, 05 Aug 2020 14:03:52 +0900   Wed, 05 Aug 2020 13:17:44 +0900   KernelHasNoDeadlock          kernel has no deadlock
  ReadonlyFilesystem   False   Wed, 05 Aug 2020 14:03:52 +0900   Wed, 05 Aug 2020 13:17:44 +0900   FilesystemIsNotReadOnly      Filesystem is not read-only
  UpFile               True    Wed, 05 Aug 2020 14:03:52 +0900   Wed, 05 Aug 2020 13:18:44 +0900   UpFileDoesNotExist           /custom-data/up does not exist
  MemoryPressure       False   Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:50 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:50 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:50 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:52 +0900   KubeletReady                 kubelet is posting ready status

$ kubectl get node minikube -o yaml | grep -B 5 "type: UpFile"
  - lastHeartbeatTime: "2020-08-05T05:03:52Z"
    lastTransitionTime: "2020-08-05T04:18:44Z"
    message: /custom-data/up does not exist
    reason: UpFileDoesNotExist
    status: "True"
    type: UpFile

カスタムプラグインを登録して NPD を実行する

カスタムプラグインの設定と監視スクリプトが用意できたら、NPD に登録して実行します。下記は、サンプルのマニフェストです。関係のある部分だけを抜き出して説明します。

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
spec:
    ...
    spec:
      containers:
      - name: node-problem-detector
        image:  "k8s.gcr.io/node-problem-detector:v0.8.2"
        command:
        - /node-problem-detector
        - --logtostderr
        - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json
        - --prometheus-address=0.0.0.0
        - --prometheus-port=20257
        - --k8s-exporter-heartbeat-period=5m0s
        # カスタムプラグインの設定ファイルを指定します
        - --custom-plugin-monitors=/custom-config/custom-plugin-monitor.json
        volumeMounts:
        ...
        - name: custom-data
          mountPath: /custom-data
        - name: custom-config
          mountPath: /custom-config
          readOnly: true
        - name: custom-plugin
          mountPath: /custom-config/plugin
          readOnly: true
      ...
      # 今回の監視スクリプトはホスト上のファイルが存在するかどうかを確認するため、hostPath でマウントします
      - name: custom-data
        hostPath:
          path: /custom-data
          type: Directory
      # カスタムプラグインの設定ファイルは ConfigMap から取得しています
      - name: custom-config
        configMap:
          name: node-problem-detector-custom-config
      # ここではカスタムプラグインの監視スクリプトがシェルスクリプトなので ConfigMap から取得しています
      # なお、監視スクリプトは実行ファイルでなければならないので、configMap.defaultMode でファイルのモードを 555 などに設定しておくとよいでしょう
      # Go 言語で実装した場合は init-container で emptyDir で共有したディレクトリを使って NPD コンテナから参照できるようにします
      - name: custom-plugin
        configMap:
          name: node-problem-detector-custom-plugin
          defaultMode: 0555

上記マニフェストファイルの完全版は https://github.com/superbrothers-sandbox/try-node-problem-detector-custom-plugin-monitor/blob/master/node-problem-detector.yaml にあります。minikube クラスタで試せる手順が https://github.com/superbrothers-sandbox/try-node-problem-detector-custom-plugin-monitor にあるので気になる方はやってみてください。

ノードの Condition の変更によりオペレーションを実行したい

NPD によって Event の作成やノードの Condition を変更することでノードの状態が把握できるようになり、監視すれば問題に気づいて対処できるようになります。加えて問題への対処方法が特定できている場合は、それを自動化したいと思うはずです。

ノードの特定の Condition の変更によって対象のノードに対して kubectl drain を実行するには planetlabs/draino が使えます。これは、ノードのラベルや Condition に基づいて自動的に対象のノードを drain します。

ノードの特定の Condition の変更によって対象のノードに対して任意の処理を実行するには、pfnet-research/node-operation-controller が使えます。これにより問題のあるノードを自動的に再起動するといったことができます。このコントローラは、実際に PFN 社のクラスタで使用しています。

さいごに

ここでは、NPD でカスタムプラグインを使用する方法について説明しました。NPD は様々な方法でノードの Condition を設定できべんりです。また最後にさらっと紹介したいくつかのコントローラと組み合わせることで、ノードのオペレーションの自動化にも取り組めます。ぜひ使ってみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up