Kubernetesクラスタ内の特定Nodeが起動しなくなった

Posted at 2025-06-28

はじめに

Kubernetesクラスタ上で、特定ノードがNotReadyからReadyに一時間立っても再起動しても切り替わらないので、調査を実施して解決した記録を残す

調査

sshで対象のサーバに入ってて以下のコマンドで、コンテナのプロセスが動いているか確認

$ sudo crictl ps
FATA[0000] validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/containerd/containerd.sock": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: connection refused"

上記の結果からcontainerdのサービスが通り動いていないことがわかったので、
systemctrlのコマンドで再起動しても動かないので、次のコマンドで、ログの確認を実施

journalctl -xeu containerd.service

調べていたら、以下のとおりのログがあることがわかった

Failed to run CRI service" error="failed to recover state: failed to get metadata for stored sandbox \"0bfea79947d71b6fe8fc96e223ad2d43d75d960e0886156face5547d8bb52c67\": not found

検索して、調べてみたらGitHubのIssueで同じ現象を起きている人があり解決方法があったので以下のコマンドを実施する

sudo systemctl stop containerd
sudo mv /var/lib/containerd /var/lib/containerd.bak
sudo systemctl start containerd

上記を実施したら、以下の通りコンテナランタイムが動いていることがわかったので、解決した

$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD                 NAMESPACE

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up