More than 5 years have passed since last update.

IBM Cloud Private付属のPrometheusが突然死を繰り返す場合の対処

Posted at 2018-06-19

経緯

IBM Cloud Private(ICP)には標準でPrometheusが導入されますが、導入して間もなく、Prometheusのコンテナが突然死して再起動を繰り返していることに気づきました。コンテナログを見ると普通にサービスが起動しているのですが、しばらくするとコンテナ自体がクラッシュして再起動がかかっていました。

結論ですが、ノードのOS側でOOM Killerが動いていました。

# dmesg | grep prometheus
[287787.255395] prometheus invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=984
[287787.255401] prometheus cpuset=2a3fd55af7e6b229ba978ee840e193c473c1b65a45ca9afbb5aba61073115f2f mems_allowed=0
[287787.255405] CPU: 0 PID: 24987 Comm: prometheus Kdump: loaded Not tainted 3.10.0-862.3.2.el7.x86_64 #1
[287787.255661] [24933]     0 24933   212094   129001     302        0           984 prometheus
[287787.255664] Memory cgroup out of memory: Kill process 25005 (prometheus) score 1969 or sacrifice child
[287787.258035] Killed process 24933 (prometheus) total-vm:848376kB, anon-rss:516004kB, file-rss:0kB, shmem-rss:0kB

PromethuesのPodの定義を見ると、メモリリミットが512MiBでした。これを広げてあげましょう。

$ kubectl get deployment monitoring-prometheus -n kube-system -o yaml
(略)
        resources:
          limits:
            cpu: 500m
            memory: 512Mi

手順

$ kubectl edit deployment monitoring-prometheus -n kube-system
-> 先ほどの512Miを1Gi等に変えて保存

これで自動的にデプロイメントが破棄されて新しいメモリリミットで起動します。楽ですね。

以上です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up