More than 1 year has passed since last update.

Kubernetes: cgroup v2 使用時に "failed to create fsnotify watcher: too many open files" エラーが発生する問題の対策

Last updated at 2023-02-21Posted at 2023-02-21

はじめに

Kubernetes 1.25 で cgroup v2 が GA しました。Kubernetes で cgroup v2 に移行するとメモリ QoS が導入され、Pod のメモリ使用量が増加した際の安定性が向上するメリットがあります¹。そこでクラスタを Kubernetes 1.25 にアップグレードする際に一緒に cgroup v2 に移行したのですが、移行後に kubectl logs -f コマンドやノードのホスト上で journalctl -f を実行した際に次のエラーが発生するようになりました。

$ kubectl logs -f deploy/nginx
failed to create fsnotify watcher: too many open files

$ journalctl -u kubelet -f
Insufficient watch descriptors available. Reverting to -n.

ここではなぜエラーが発生するようになったかとこのエラーの対策方法を紹介します。

検証環境

OS: Ubuntu 22.04.1 LTS
Kubernetes: v1.25.5
containerd: v1.6.12

TL;DR

cgroup v2 使用時に containerd が1つの Pod あたり最低2つの inotify instance を消費する。Ubuntu 22.04 は fs.inotify.max_user_instances のデフォルト値が 128 のため、少なくとも 64 Pod 以上作成されると inotify instance の作成に失敗するようになる
fs.inotify.max_user_instances を十分に大きい値に変更することで問題を解消できる

原因調査

cgroup v2 を使用するのは containerd のため、containerd にあたりをつけてイシューを検索してみると下記が見つかりました。cgroup v2 使用時にメモリイベントの監視のために使用する inotify instances がリークしているというものでしたが、私たちの環境では修正パッチが含まれるバージョンを使用していたため、これが原因ではありませんでした。

しかしリークしていないにしても inotify instance を使用していることはたしかです。ホストでいくつの inotify instance が使用されているかは次のコマンドでわかります。

$ sudo find /proc/*/fd -lname anon_inode:inotify | wc -l
128

Ubuntu 20.04 と 22.04 で inotify instance 数の上限値はデフォルトで 128 なので、すでに上限に達しており新しく inotify instance が作成できなくなっていることが原因であることがわかりました。

$ sudo sysctl -a | grep fs.inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192

次に各プロセスがいくつの inotify instance を使用しているかを確認してみます。すると1つの containerd-shim プロセスあたり少なくとも2つの inotify instance を使用していることがわかります。containerd-shim プロセスは1つの Pod に相当します。

$ containerd -v
containerd github.com/containerd/containerd v1.6.12 a05d175400b1145e5e6a735a6710579d181e7fb0
$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
none on /run/calico/cgroup type cgroup2 (rw,relatime)
$ sudo find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c %a %P' -p '{}' | uniq -c | sort -nr | grep containerd-shim
      7    2855 root     containerd-shim /usr/local/bin/containerd-s       1
      5    2737 root     containerd-shim /usr/local/bin/containerd-s       1
      4    2694 root     containerd-shim /usr/local/bin/containerd-s       1
      3    8229 root     containerd-shim /usr/local/bin/containerd-s       1
      3    7982 root     containerd-shim /usr/local/bin/containerd-s       1
      3    7859 root     containerd-shim /usr/local/bin/containerd-s       1
      3    7117 root     containerd-shim /usr/local/bin/containerd-s       1
      2    7720 root     containerd-shim /usr/local/bin/containerd-s       1
      2    7660 root     containerd-shim /usr/local/bin/containerd-s       1
      2    7569 root     containerd-shim /usr/local/bin/containerd-s       1
      2    7494 root     containerd-shim /usr/local/bin/containerd-s       1
      2    7433 root     containerd-shim /usr/local/bin/containerd-s       1
      2    4260 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2907 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2896 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2721 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2609 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2527 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2397 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2280 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2182 root     containerd-shim /usr/local/bin/containerd-s       1
      2    2148 root     containerd-shim /usr/local/bin/containerd-s       1
      2  200567 root     containerd-shim /usr/local/bin/containerd-s       1
      2  200501 root     containerd-shim /usr/local/bin/containerd-s       1
      2  194703 root     containerd-shim /usr/local/bin/containerd-s       1

例えば kube-apiserver Pod が消費している inotify instance 数を確認すると2つ消費していることがわかります。

$ sudo crictl ps | grep kube-apiserver
4e45abb7544ad       5057262eb2f75       9 days ago          Running             kube-apiserver                  2                   245dd54b92896       kube-apiserver-jm00z0cm00
$ ps aux | grep 245dd54b92896
root        2182  0.0  0.0 712200  9928 ?        Sl    2022   5:23 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 245dd54b9289621daef957e47fd060cd7cb6172cf15652576fdfffe9677e6ce5 -address /run/containerd/containerd.sock
ksuda     701335  0.0  0.0   6608  2336 pts/0    S+   01:59   0:00 grep --color=auto 245dd54b92896
$ sudo ls -al /proc/2182/fd | grep anon_inode:inotify | wc -l
2

デフォルトの OS 設定では inotify instance が128しか作成できないが、cgroup v2 では1つの Pod あたり2つの inotify instance を消費する、つまり少なくとも64個の Pod が作成されるとそれだけで inotify instance が作成できる上限に達してしまいます。1ノードに作成できる Pod 数の上限はデフォルトで 110 なので、この場合 inotify instance 数の上限値が 220 以上でなければいけません。またアプリケーションコンテナも inotify instance を使用することがあるので、それよりも十分に大きな値である必要があります。

ちなみに cgroup v1 の環境下では inotify instance を使用していないことがわかります。

$ containerd -v
containerd github.com/containerd/containerd v1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,inode64)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/misc type cgroup (rw,nosuid,nodev,noexec,relatime,misc)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
none on /run/calico/cgroup type cgroup2 (rw,relatime)

$ sudo find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c %a %P' -p '{}' | uniq -c | sort -nr | grep containerd-shim

対策

原因は containerd が inotify instance を使い切ってしまうことが原因でした。そのため、対策は inotify instance 数の上限値を上げることになります。値は十分に大きければよいと思います。inotify instance は実際に作成されなければリソースを消費しません。ここでは関連する設定も合わせて64倍に設定しています。

sysctl コマンドで変更する場合は次のようにします。

$ sudo sysctl fs.inotify.max_user_instances=8192
$ sudo sysctl fs.inotify.max_user_watches=524288

再起動後も設定を永続化するなら /etc/sysctl.conf に次を追記します。

fs.inotify.max_user_instances=8192
fs.inotify.max_user_watches=524288

その他参考にできるものとして OpenShift でノードのチューニングのために提供されている "Node Tuning Operator" ではデフォルトで次の値を設定することを推奨しています。

[sysctl]
net.ipv4.tcp_fastopen=3
fs.inotify.max_user_watches=65536
fs.inotify.max_user_instances=8192

https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.14/assets/tuned/daemon/profiles/openshift-node/tuned.conf#L9-L12

おまけ: containerd が inotify instance を作成する部分のソースコードをみてみる

containerd のバージョンは調査した時点で最新の v1.6.14 です。

メモリイベントを扱う部分は下記で、memory.events ファイルの変更を監視するために syscall.InotifyInit() で inotify instance を作成し、そのファイルディスクリプタを返しています。

// MemoryEventFD returns inotify file descriptor and 'memory.events' inotify watch descriptor
func (c *Manager) MemoryEventFD() (int, uint32, error) {
	fpath := filepath.Join(c.path, "memory.events")
	fd, err := syscall.InotifyInit()
	if err != nil {
		return 0, 0, errors.New("failed to create inotify fd")
	}
	wd, err := syscall.InotifyAddWatch(fd, fpath, unix.IN_MODIFY)
	if err != nil {
		syscall.Close(fd)
		return 0, 0, fmt.Errorf("failed to add inotify watch for %q: %w", fpath, err)
	}
	// monitor to detect process exit/cgroup deletion
	evpath := filepath.Join(c.path, "cgroup.events")
	if _, err = syscall.InotifyAddWatch(fd, evpath, unix.IN_MODIFY); err != nil {
		syscall.Close(fd)
		return 0, 0, fmt.Errorf("failed to add inotify watch for %q: %w", evpath, err)
	}

	return fd, uint32(wd), nil
}

https://github.com/containerd/containerd/blob/v1.6.14/vendor/github.com/containerd/cgroups/v2/manager.go#L583-L603

memory.events ファイルに変更があると、それをイベントとして channel に書き込みます。

func (c *Manager) waitForEvents(ec chan<- Event, errCh chan<- error) {
	defer close(errCh)

	fd, _, err := c.MemoryEventFD()
	if err != nil {
		errCh <- err
		return
	}
	defer syscall.Close(fd)

	for {
		buffer := make([]byte, syscall.SizeofInotifyEvent*10)
		bytesRead, err := syscall.Read(fd, buffer)
		if err != nil {
			errCh <- err
			return
		}
		if bytesRead >= syscall.SizeofInotifyEvent {
			out := make(map[string]interface{})
			if err := readKVStatsFile(c.path, "memory.events", out); err != nil {
				// When cgroup is deleted read may return -ENODEV instead of -ENOENT from open.
				if _, statErr := os.Lstat(filepath.Join(c.path, "memory.events")); !os.IsNotExist(statErr) {
					errCh <- err
				}
				return
			}
			e, err := parseMemoryEvents(out)
			if err != nil {
				errCh <- err
				return
			}
			ec <- e
			if c.isCgroupEmpty() {
				return
			}
		}
	}
}

https://github.com/containerd/containerd/blob/v1.6.14/vendor/github.com/containerd/cgroups/v2/manager.go#L648-L685

イベントを処理しているのは次の部分です。イベントが OOMKill だったら OOM イベントとして containerd がハンドルできるようにしています。

// Add cgroups.Cgroup to the epoll monitor
func (w *watcher) Add(id string, cgx interface{}) error {
	cg, ok := cgx.(*cgroupsv2.Manager)
	if !ok {
		return fmt.Errorf("expected *cgroupsv2.Manager, got: %T", cgx)
	}
	// FIXME: cgroupsv2.Manager does not support closing eventCh routine currently.
	// The routine shuts down when an error happens, mostly when the cgroup is deleted.
	eventCh, errCh := cg.EventChan()
	go func() {
		for {
			i := item{id: id}
			select {
			case ev := <-eventCh:
				i.ev = ev
				w.itemCh <- i
			case err := <-errCh:
				// channel is closed when cgroup gets deleted
				if err != nil {
					i.err = err
					w.itemCh <- i
					// we no longer get any event/err when we got an err
					logrus.WithError(err).Warn("error from *cgroupsv2.Manager.EventChan")
				}
				return
			}
		}
	}()
	return nil
}

https://github.com/containerd/containerd/blob/v1.6.14/pkg/oom/v2/v2.go#L88-L117

// Run the loop
func (w *watcher) Run(ctx context.Context) {
	lastOOMMap := make(map[string]uint64) // key: id, value: ev.OOM
	for {
		select {
		case <-ctx.Done():
			w.Close()
			return
		case i := <-w.itemCh:
			if i.err != nil {
				delete(lastOOMMap, i.id)
				continue
			}
			lastOOM := lastOOMMap[i.id]
			if i.ev.OOMKill > lastOOM {
				if err := w.publisher.Publish(ctx, runtime.TaskOOMEventTopic, &eventstypes.TaskOOM{
					ContainerID: i.id,
				}); err != nil {
					logrus.WithError(err).Error("publish OOM event")
				}
			}
			if i.ev.OOMKill > 0 {
				lastOOMMap[i.id] = i.ev.OOMKill
			}
		}
	}
}

https://github.com/containerd/containerd/blob/v1.6.14/pkg/oom/v2/v2.go#L60-L86

メモリ QoS について、詳しくは https://kubernetes.io/blog/2021/11/26/qos-memory-resources/ を参照ください。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up