More than 5 years have passed since last update.

HelmでデプロイしたGitLabのバックアップをする際Podがevictされるときの対応策

Posted at 2020-02-23

環境情報

EKS: 1.13
kubectl: v1.13.11-eks-5876d6
Helm: 2.14.3
GitLab Helm Chart: 2.3.0 (appVersion: 12.3.0)

事象

CronJobでtask-runner Podを立ててGitLabのアプリケーションバックアップを行う際に、このPodが途中でevictされてバックアップに失敗してしまう。

task-runner Podが立っていたNodeをkubectl describeした結果は次のようになる。

$ kubectl describe node ip-10-10-101-14.ap-northeast-1.compute.internal 
...
...
Events:
  Type     Reason                   Age                From                                                         Message
  ----     ------                   ----               ----                                                         -------
  Normal   Starting                 56m                kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Starting kubelet.
  Normal   NodeHasSufficientMemory  56m (x2 over 56m)  kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Node ip-10-10-101-14.ap-northeast-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    56m (x2 over 56m)  kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Node ip-10-10-101-14.ap-northeast-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     56m (x2 over 56m)  kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Node ip-10-10-101-14.ap-northeast-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  56m                kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Updated Node Allocatable limit across pods
  Normal   Starting                 56m                kube-proxy, ip-10-10-101-14.ap-northeast-1.compute.internal  Starting kube-proxy.
  Normal   NodeReady                56m                kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Node ip-10-10-101-14.ap-northeast-1.compute.internal status is now: NodeReady
  Warning  FreeDiskSpaceFailed      11m                kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     failed to garbage collect required amount of images. Wanted to free 17639540326 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      6m52s              kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     failed to garbage collect required amount of images. Wanted to free 23642674790 bytes, but freed 0 bytes
  Warning  ImageGCFailed            6m52s              kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     failed to garbage collect required amount of images. Wanted to free 23642674790 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet     3m55s              kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Attempting to reclaim ephemeral-storage
  Normal   NodeHasDiskPressure      3m46s              kubelet, ip-10-10-101-14.ap-northeast-1.compute.internal     Node ip-10-10-101-14.ap-northeast-1.compute.internal status is now: NodeHasDiskPressure

FreeDiskSpaceFailedというメッセージから大体憶測がつきます。

原因

GitLabのアプリケーションバックアップに仕組みとしてGitリポジトリデータ、ユーザデータ、画像データなどを全てまとめて1つのtarにしてS3に転送することになっています。
この過程で生成される一時的なデータはtask-runner Podのファイルシステムに格納されます。
つまり、十分なディスク容量がない場合はevictされることになるわけです。

対応策

task-runner PodにPersistentVolumeを紐づけることにします。

GitLab Helm Chartのvaluesを下記のようにgitlab.task-runner.cron.persitenceの項目を追加すればよいです。

my-values.yaml

gitlab:
  task-runner:
    backups:
      cron:
        enabled: true
        schedule: "0 19 * * *"
+       persistence:
+         enabled: true
+         size: 50Gi

この項目を加えることでPersistentVolumeが作成されます。
気をつけないといけないのはバックアップが終わるとPersistentVolumeが削除されるわけではなく、ずっと立ち続けるということです。
料金を気にする人は、バックアップ成功時にPersistentVolumeを削除するロジックを組み込む必要があります。

リリースを更新するためにhelm upgradeします。リリース名はdev-gitlabとしています。

$ helm upgrade \
  --namespace default \
  --values my-values.yaml \
  --version 2.3.0 \
  dev-gitlab gitlab/gitlab

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up