More than 1 year has passed since last update.

KubernetesのPodでメモリ関連のevictionが発生するタイミングについて

Last updated at 2023-03-24Posted at 2023-03-24

はじめに

たまにクラスタでこんな感じのevictionが発生していたのですが、Podが再起動した後は正常に動いていたので原因を調べなきゃと思いながらも放置していました。

The node was low on resource: memory. Container xxx was using xxxKi, which exceeds its request of xxGi.

原因調査をしたのですが、その際に改めてevictionってどんな時に起こるんだっけということを確認したのでその備忘録として記事にしておきます。
なお、今回は上記にある通りメモリに関するevictionについてが主な内容となっています。

evictionが発生するパターン

evictionはPodが削除される、ということが発生します。
メモリに関連したPodのevictionは以下のようなパターンで発生します。

Nodeの空きメモリが、設定してあるメモリ閾値(eviction-hardのmemory.available)より少なくなった時
Priorityの高いPodがスケジュールされる際にNodeのリソースを超えたリクエストになった時
※k8s.ioのドキュメントではpreempt (evict) と記載され、Preemptionと呼ばれています。

上記についてそれぞれどんな感じで発生するのかを見ていきたいと思います。
なお、EKS・AKSはどちらもバージョン1.24で確認しました。

1. Nodeの空きメモリが、設定してあるメモリ閾値(eviction-hardのmemory.available)より少なくなった時

こちらが今回evictionが発生した原因でした。evictionされたPodのメモリ利用率はLimitsに到達していなかったのですが、Node全体のメモリ利用率が上がり空きメモリが閾値以下になりevictionされてしまいました。

閾値の設定について

閾値はkubeletの--evction-hardオプションのmemory.availableで指定されます。(Eviction Thresholds)

EKSとAKSでは以下のような値で設定されています。

環境	memory eviction閾値	参照
AKS	750Mi	Azure Kubernetes Services (AKS) における Kubernetes の基本概念 - Azure Kubernetes Service
EKS	100Mi	amazon-eks-ami/bootstrap.sh at master · awslabs/amazon-eks-ami

同じ8GBメモリのNodeであってもevictionが発生するメモリ利用率がAKSでは約7.25GB、EKSでは約7.9GBというように、環境によって異なるので確認するようにしましょう。
ちなみにメモリ関連でevictionが発生するとNodeの MemoryPressure ConditionがtrueになりPodのスケジュールできなくなります。

Nodeのリソースについて

evictionを理解するためにもNodeのリソースがどうなっているかということも再確認しておきます。
NodeのメモリやCPUのリソースは以下のようになっています。

項目	説明
Capacity	Nodeに搭載されているリソースの全量
kube-reserved	kubeletやcontainer runtimeのために予約されているリソース
system-reserved	OSのデーモンなどのために予約されているリソース
eviction-hard	evictionの閾値として確保されたリソース
Allocatable	kubernetesが利用可能なリソース Capacity - (kube-reserve + system-reserved + eviction-hard)

EKS/AKSにおけるevition-hardの値は固定値ですが、kube-reservedはどうなっているか見てみましょう。
ちなみにsystem-reservedはどちらも指定されていなかったので0になります。

EKS
11*MAX_POD_NUM + 255 MB となっていました。(amazon-eks-ami/bootstrap.sh at master · awslabs/amazon-eks-ami)
MAX_POD_NUMの値はインスタンスサイズにより変化します。デフォルト値一覧はamazon-eks-ami/eni-max-pods.txt at master · awslabs/amazon-eks-amiにあります。
なので例えばですが、m5.xlarge(16GBメモリ)では893Mi, m5.large(8GBメモリ)では574Miというようになります。
AKS
Azure Kubernetes Services (AKS) における Kubernetes の基本概念 - Azure Kubernetes Serviceを見ると以下のようになっていました。

kubelet デーモンが適切に機能するための予約されているメモリの回帰率 (kube-reserved)。
・最初の 4 GB のメモリの 25%
・次の 4 GB のメモリの 20% (最大 8 GB)
・次の 8 GB のメモリの 10% (最大 16 GB)
・次の 112 GB のメモリの 6% (最大 128 GB)
・128 GB を超えるメモリの 2%

ということなので、例えばB2MS(8GBメモリ)のVMでは1.8GBというようになります。

EKSと比べ、AKSはkube-reservedが大きくなるようなのでAllocatableが小さくなります。
8GBのVMをkubectl describe nodeで確認すると以下のようにAllocatableが約5.2GBと搭載メモリよりかなり小さくなっていることが分かります。
AKSの方はeviction-hardのところでもそうでしたが、リソースに余裕を持たせるようになっていました。

AKS - B2MS VM

Capacity:
  cpu:                2
  ・・・
  memory:             8148456Ki
  pods:               30
Allocatable:
  cpu:                1900m
  ・・・
  memory:             5493224Ki
  pods:               30

EKSではrequestsが足りていてスケジュールできていたのにAKSでは足りない、みたいなことが起こりえるので注意しましょう。
また、4GBや8GBといったメモリが比較的小さいVMサイズでは搭載メモリに対してAllocatableのメモリの割合が特に小さくなってしまうので特に注意しましょう。

閾値はCapacityに対してなのか？

改めてevictionについて確認する前、eviction-hardの閾値はCapacityに対してなのか、Capacityからsystem-reservedとkube-reservedの値を引いたものに対してなのか？マークでした。
「reserved」とあるのでevictionが発生する閾値は前者で以下なのではないかと思っていました。

Capacity - (system-reserved + kube-reserved + eviction-hard) => AKSの8GB VMの例: 8 - (0 + 1.8 +0.75) = 5.45GB)

ですが上記で述べた通りCapacityからeviction値を引いたものが閾値となっています。

Capacity - eviction-hard => AKSの8GB VMの例: 8 - 0.75 = 7.25GB

これはkubeletの--enforce-node-allocatable関連の設定によって挙動が異なります。
--enforce-node-allocatableにsystem-reservedとkube-reservedの両方が指定されていると前者の
Capacity - (system-reserved + kube-reserved + eviction-hard)
が閾値となります。
ですがEKS/AKSどちらも指定はされていないので
Capacity - eviction-hard
が閾値となります。

補足)evictionされるPodの優先度

以下の優先度でevictionされるPodが選定されます。
1. リソースをrequests以上に使用している
2. PodのQoS(後述)
3. requestsに対するリソースの使用量

Podには以下のようにresourceのrequests/limitsの設定によってQoSクラスが割り当てられます。
PodにQuality of Serviceを設定する | Kubernetes

QoS	設定
Guaranteed	Pod内すべてのコンテナにメモリ・CPUのrequests/limitsが設定されてあり、その値が同じ
Burstable	Pod内の1つ以上のコンテナがメモリ or CPUのrequestsが設定されてある
BestEffort	メモリ,CPUどちらのrequests/limitsも設定されていない

Podがevictionされる優先度は BestEffort > Burstable > Guaranteed となっています。

ということで、簡単にまとめるとNodeのメモリ利用量が閾値を超えるとevictionが発生しPodが落とされるということでした。

2. Priorityの高いPodがスケジュールされる際にNodeのリソースを超えたリクエストになった時

Podには以下のようにPriorityを設定することができます。

Podの優先度とプリエンプション | Kubernetesより
Podは priority（優先度）を持つことができます。優先度は他のPodに対する相対的なPodの重要度を示します。もしPodをスケジューリングできないときには、スケジューラーはそのPodをスケジューリングできるようにするため、優先度の低いPodをプリエンプトする（追い出す）ことを試みます。

Priorityの高いPodがスケジュールされる際に、メモリのrequestsがAllocatableの100%を超える場合そのNode上のPriorityの低いPodを追い出してしまうといったケースです。

優先度はPriorityClassで定義され、Valueの値が大きいほど優先度が大きくなります。
デフォルトでsystem-cluster-criticalとsystem-node-criticalというPriorityClassがあり、EKSにもAKSにもデフォルトで作成されています。

$ kubectl describe priorityclasses.scheduling.k8s.io -A
Name:              system-cluster-critical
Value:             2000000000
GlobalDefault:     false
PreemptionPolicy:  PreemptLowerPriority
Description:       Used for system critical pods that must run in the cluster, but can be moved to another node if necessary.
Annotations:       <none>
Events:            <none>


Name:              system-node-critical
Value:             2000001000
GlobalDefault:     false
PreemptionPolicy:  PreemptLowerPriority
Description:       Used for system critical pods that must not be moved from their current node.
Annotations:       <none>
Events:            <none>

PreemptionPolicyがPreemptLowerPriority(デフォルト)の場合Preemptが発生しますが、Neverに設定されていると発生しないです。

基本的にkube-system Namespaceで動作しているPodにはどちらかのPriorityClassが設定されていました。

$ kubectl describe deploy,ds -n kube-system
# 抜粋
Name:                   coredns
  Priority Class Name:  system-cluster-critical

Name:                   ebs-csi-controller
  Priority Class Name:  system-cluster-critical

Name:           aws-node
  Priority Class Name:  system-node-critical

Name:           ebs-csi-node
  Priority Class Name:  system-node-critical

Name:           ebs-csi-node-windows
  Priority Class Name:  system-node-critical

Name:           kube-proxy
  Priority Class Name:  system-node-critical
・・・

なのでリソースにあまり余裕がない状態で稼働しているクラスタの場合、新たに各クラウドサービスの監視エージェント機能を有効にするときやEKSでのアドオンを追加するときなどは一度リソース状況を確認した方が良いです。

おまけ(試してみる)

eviction-hardの動作

Nodeは2GBメモリ、eviction-hardは空きメモリが750MBを下回ったら(memory.available<750Mi)となっています。
stressコマンド用のPodを作成します。
stressコマンドが入ったイメージはAssign Memory Resources to Containers and Pods | Kubernetesのサンプルを使いました。

stress

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress
  labels:
    app: stress
    role: sample
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
      role: sample
  template:
    metadata:
      labels:
        app: stress
        role: sample
    spec:
      containers:
      - name: memory-stress
        image: polinux/stress
        resources:
          requests:
            memory: "50Mi"
          limits:
            memory: "1.5Gi"
        command: ["sh", "-c", "while true; do sleep 10; done"]
        ports:
        - containerPort: 80

上記Deployment作成後、stress Podにアクセスしてメモリ負荷をかけていきます。
(今回Nodeのメモリ利用率が約400MBだったので1GBの負荷をかけています)

$ kubectl exec -it stress-5bcd98cb59-zmppw -- bash
bash-5.0# stress --vm 1 --vm-bytes 1G --vm-hang 2
stress: info: [161] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

kubectl describe nodeコマンドで見てみると、MemoryPressureがtrueになっていることが分かります。

$ kubectl describe node
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                         Message
  ----                 ------  -----------------                 ------------------                ------                         -------
・・・
  MemoryPressure       True    Fri, 24 Mar 2023 00:17:19 +0000   Fri, 24 Mar 2023 00:17:19 +0000   KubeletHasInsufficientMemory   kubelet has insufficient memory available
  Ready                True    Fri, 24 Mar 2023 00:17:19 +0000   Thu, 23 Mar 2023 23:51:26 +0000   KubeletReady                   kubelet is posting ready status. AppArmor enabled
・・・
Events:
  Type     Reason                     Age   From     Message
  ----     ------                     ----  ----     -------
・・・
  Warning  EvictionThresholdMet       2m3s  kubelet  Attempting to reclaim memory
  Normal   NodeHasInsufficientMemory  115s  kubelet  Node node01 status is now: NodeHasInsufficientMemory

Podのステータスは以下のように遷移していました。

$ kubectl get po -o wide -w
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
stress-5bcd98cb59-zmppw             1/1     Running   0          20m   192.168.1.4   node01   <none>           <none>
stress-5bcd98cb59-zmppw             1/1     Running   0          25m   192.168.1.4   node01   <none>           <none>
stress-5bcd98cb59-zmppw             0/1     Error     0          25m   192.168.1.4   node01   <none>           <none>
stress-5bcd98cb59-6b4vq             0/1     Pending   0          0s    <none>        <none>   <none>           <none>
stress-5bcd98cb59-zmppw             0/1     Error     0          26m   192.168.1.4   node01   <none>           <none>
stress-5bcd98cb59-6b4vq             0/1     Pending   0          4m34s   <none>        node01   <none>           <none>
stress-5bcd98cb59-6b4vq             0/1     ContainerCreating   0          4m34s   <none>        node01   <none>           <none>
stress-5bcd98cb59-6b4vq             0/1     ContainerCreating   0          4m34s   <none>        node01   <none>           <none>
stress-5bcd98cb59-6b4vq             1/1     Running             0          4m35s   192.168.1.5   node01   <none>           <none>

空きメモリが少なくなったことによりPodがEvictionされ再スケジュールされています。
しかしNodeがMemoryPressureによりスケジュールできなくなっているためしばらくPendingになっていました。
しばらくするとMemoryPressureが解除されスケジュール可能になりstress Podが起動しました。

Preemption

evictionのときと同じく2GBメモリのNodeで試しています。
先ほどのstress Podのメモリrequestsを1.5GBに変更し再作成します。

stressメモリrequests変更

・・・
        resources:
          requests:
            memory: "1.5Gi"

nodeのリソース状況を確認すると以下のようにrequestsの余裕が無くなっています。

$ kubectl describe node
・・・
Non-terminated Pods:          (4 in total)
  Namespace                   Name                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                        ------------  ----------  ---------------  -------------  ---
  default                     stress-69d5997954-x8fhg     0 (0%)        0 (0%)      1536Mi (81%)     1536Mi (81%)   24s
  kube-system                 canal-nwxks                 25m (2%)      0 (0%)      0 (0%)           0 (0%)         39m
  kube-system                 coredns-68dc769db8-bl7xg    50m (5%)      0 (0%)      50Mi (2%)        170Mi (9%)     28d
  kube-system                 kube-proxy-84r66            0 (0%)        0 (0%)      0 (0%)           0 (0%)         28d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                75m (7%)      0 (0%)
  memory             1586Mi (84%)  1706Mi (90%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

この状態で以下のPriorityClassとPodを作成しNodeのrequestsの上限以上になるようにしてみます。

priority-test

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: test-priority
preemptionPolicy: PreemptLowerPriority
value: 1000000000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: high-priority-nginx
  labels:
    app: high-priority-nginx
    role: sample
spec:
  replicas: 1
  selector:
    matchLabels:
      app: high-priority-nginx
      role: sample
  template:
    metadata:
      labels:
        app: high-priority-nginx
        role: sample
    spec:
      priorityClassName: test-priority
      containers:
      - name: nginx
        image: nginx:1.21.6
        resources:
          requests:
            memory: "1Gi"
          limits:
            memory: "1.5Gi"
        ports:
        - containerPort: 80

Podのステータスを見てみると以下のように優先度の高いPodによりstress PodがPreemptionされていました。

$ kubectl get po -o wide -w
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
stress-69d5997954-x8fhg   1/1     Running   0          4m54s   192.168.1.3   node01   <none>           <none>
high-priority-nginx-5788c84cd8-swbqz   0/1     Pending   0          0s      <none>        <none>   <none>           <none>
stress-69d5997954-x8fhg                1/1     Running   0          5m51s   192.168.1.3   node01   <none>           <none>
stress-69d5997954-x8fhg                1/1     Terminating   0          5m51s   192.168.1.3   node01   <none>           <none>
stress-69d5997954-ftj5k                0/1     Pending       0          0s      <none>        <none>   <none>           <none>
stress-69d5997954-x8fhg                1/1     Terminating   0          5m51s   192.168.1.3   node01   <none>           <none>
high-priority-nginx-5788c84cd8-swbqz   0/1     Pending       0          0s      <none>        <none>   node01           <none>
stress-69d5997954-ftj5k                0/1     Pending       0          0s      <none>        <none>   <none>           <none>
high-priority-nginx-5788c84cd8-swbqz   0/1     Pending       0          2s      <none>        <none>   node01           <none>
stress-69d5997954-x8fhg                1/1     Terminating   0          6m22s   192.168.1.3   node01   <none>           <none>
stress-69d5997954-x8fhg                0/1     Terminating   0          6m22s   192.168.1.3   node01   <none>           <none>
stress-69d5997954-x8fhg                0/1     Terminating   0          6m22s   192.168.1.3   node01   <none>           <none>
stress-69d5997954-x8fhg                0/1     Terminating   0          6m22s   192.168.1.3   node01   <none>           <none>
high-priority-nginx-5788c84cd8-swbqz   0/1     Pending       0          31s     <none>        node01   node01           <none>
high-priority-nginx-5788c84cd8-swbqz   0/1     ContainerCreating   0          31s     <none>        node01   <none>           <none>
high-priority-nginx-5788c84cd8-swbqz   0/1     ContainerCreating   0          32s     <none>        node01   <none>           <none>
high-priority-nginx-5788c84cd8-swbqz   1/1     Running             0          38s     192.168.1.4   node01   <none>           <none>

eventからも以下のようにPreemptionされたことが確認できます。

$ kubectl get event
・・・
5m8s        Normal    Killing                   pod/stress-69d5997954-x8fhg                 Stopping container memory-stress
5m8s        Normal    Preempted                 pod/stress-69d5997954-x8fhg                 Preempted by a pod on node node01

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up