TL;DR
containerdとkubeletが利用するcgroups driverの設定が合ってなかったせいで死んでいました。
デフォルト設定だと異なるのでしっかりconfigまで目を通します!!
出たエラー
ラズパイでkubeadm init
を実行してK8sのコントロールnodeを作成して、kubectl get node
でちゃんと表示されるまで確認したが突如kubectlが上手くいかない...
再起動すると少しの間kubectl get xxx
が使える奇跡の時間が発生する(この時間を使ってデバッグする)
$ kubectl get pod -A
The connection to the server 192.168.10.49:6443 was refused - did you specify the right host or port?
(kubelet 再起動したら、時々復活してすぐ死ぬ奇跡の時間がある)
$ systemctl restart kubelet
$ kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-6f6b679f8f-7rkfn 0/1 Pending 0 78m
kube-system coredns-6f6b679f8f-mw8lr 0/1 Pending 0 78m
kube-system etcd-control 0/1 Running 256 (30s ago) 79m
kube-system kube-apiserver-control 0/1 Running 250 (10s ago) 79m
kube-system kube-controller-manager-control 0/1 Running 259 (10s ago) 77m
kube-system kube-proxy-nxdjv 0/1 Error 39 (38s ago) 78m
kube-system kube-scheduler-control 0/1 Running 266 (10s ago) 79m
$ kubectl get pod -A
The connection to the server 192.168.10.49:6443 was refused - did you specify the right host or port?
エラーを追う
とりあえずkubeletからkube-apiserver-control を確認
$ jounalctl -u kubelet | grep kube-apiserver
.
.
.
Unable to write event (may retry after sleeping)" err="Post \"https://192.168.10.49:6443/api/v1/namespaces/kube-system/events\": dial tcp 192.168.10.49:6443: connect: connection refused" event="&Event{ObjectMeta:{kube-apiserver-control.17ed1e413f5907c9 kube-system 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:kube-apiserver-control,UID:a215b048c264f32757b87146b6148bb2,APIVersion:v1,ResourceVersion:,FieldPath:spec.containers{kube-apiserver},},Reason:Unhealthy,Message:Readiness probe failed: Get \"https://192.168.10.49:6443/readyz\": dial tcp 192.168.10.49:6443: connect: connection refused,Source:EventSource{Component:kubelet,Host:control,},FirstTimestamp:2024-08-19 20:38:17.937274825 +0900 JST m=+51.754393154,LastTimestamp:2024-08-19 20:38:17.937274825 +0900 JST m=+51.754393154,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:control,}"
apiserverが落ちてそうでアクセス出来てない...
kube-apiserver-controlのpodのlogを確認
$ kubectl logs kube-apiserver-control -n kube-system -f
.
.
.
W0819 11:46:40.808045 1 logging.go:55] [core] [Channel #52 SubChannel #53]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
W0819 11:46:40.857145 1 logging.go:55] [core] [Channel #28 SubChannel #29]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
W0819 11:46:40.895323 1 logging.go:55] [core] [Channel #46 SubChannel #47]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
W0819 11:46:40.904756 1 logging.go:55] [core] [Channel #118 SubChannel #119]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
W0819 11:46:40.905834 1 logging.go:55] [core] [Channel #85 SubChannel #86]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
W0819 11:46:40.915920 1 logging.go:55] [core] [Channel #169 SubChannel #170]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
W0819 11:46:40.955245 1 logging.go:55] [core] [Channel #124 SubChannel #125]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
etcdの通信が切れてるから落ちてるっぽい
etcdのpodを確認してみる
$ kubectl logs etcd-control -n kube-system -f
.
.
.
{"level":"info","ts":"2024-08-19T12:06:35.684472Z","caller":"osutil/interrupt_unix.go:64","msg":"received signal; shutting down","signal":"terminated"}
{"level":"info","ts":"2024-08-19T12:06:35.684601Z","caller":"embed/etcd.go:377","msg":"closing etcd server","name":"control","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://192.168.10.49:2380"],"advertise-client-urls":["https://192.168.10.49:2379"]}
{"level":"warn","ts":"2024-08-19T12:06:35.684930Z","caller":"embed/serve.go:212","msg":"stopping secure grpc server due to error","error":"accept tcp 127.0.0.1:2379: use of closed network connection"}
{"level":"warn","ts":"2024-08-19T12:06:35.685225Z","caller":"embed/serve.go:214","msg":"stopped secure grpc server due to error","error":"accept tcp 127.0.0.1:2379: use of closed network connection"}
{"level":"warn","ts":"2024-08-19T12:06:35.734420Z","caller":"embed/serve.go:212","msg":"stopping secure grpc server due to error","error":"accept tcp 192.168.10.49:2379: use of closed network connection"}
{"level":"warn","ts":"2024-08-19T12:06:35.734556Z","caller":"embed/serve.go:214","msg":"stopped secure grpc server due to error","error":"accept tcp 192.168.10.49:2379: use of closed network connection"}
{"level":"info","ts":"2024-08-19T12:06:35.734741Z","caller":"etcdserver/server.go:1521","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"2775615e805387bb","current-leader-member-id":"2775615e805387bb"}
{"level":"info","ts":"2024-08-19T12:06:35.747342Z","caller":"embed/etcd.go:581","msg":"stopping serving peer traffic","address":"192.168.10.49:2380"}
{"level":"info","ts":"2024-08-19T12:06:35.747782Z","caller":"embed/etcd.go:586","msg":"stopped serving peer traffic","address":"192.168.10.49:2380"}
{"level":"info","ts":"2024-08-19T12:06:35.748327Z","caller":"embed/etcd.go:379","msg":"closed etcd server","name":"control","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://192.168.10.49:2380"],"advertise-client-urls":["https://192.168.10.49:2379"]}
外部(きっとkubelet)からSIGTERM 来てるやん...どういうこっちゃ
整理
kubelet何しているの...
etcdにsigtermを流すlogで検索
"msg":"received signal; shutting down","signal":"terminated"
で検索すると似たような質問がいっぱい
コミュニティの方で面白い記事があったので読む
In the past few days, I've been trying to deploy a Kubernetes cluster using kubeadm in Debian. After kubeadm configured the control plane, something was killing the pods randomly after a while. This post follows the story of how I tried to fix this issue.
これだ!!
記事の要約
問題が発生する原因
containerdとkubeletが利用するcgroups driverの設定が異なっているせいでした。
デフォルト設定では、containerdがcgroupfs、kubectlがsystemdと設定されているため発生していた。
(dockerを利用している場合は、systemdに設定されるから影響が出なかった)
ベストな解決策
contaienrdの設定をsystemdに設定する
#ファイル /etc/containerd/config.toml の内容
version = 2
[ plugins ]
[plugins. "io.containerd.grpc.v1.cri" ]
[plugins. "io.containerd.grpc.v1.cri" .containerd]
[plugins. "io.containerd.grpc.v1.cri" .containerd.runtimes]
[plugins. "io.containerd.grpc.v1.cri" .containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins. "io.containerd.grpc.v1.cri" .containerd.runtimes.runc.options]
SystemdCgroup = true
問題が発生する流れ
詳細は記事を見て下さい
- containerdはcgroupfsを用いて、コンテナのcgroupを作成
- kubeletは、QoS cgroupのCPUWeightプロパティを変更するために、systemdにD-Busメッセージを送信(毎分行っている)
- systemdは、cpu.weightファイルに書き込む
- systemdが理由不明でcpusetコントローラーを削除(sysmtedが作成してないcgroupだから削除したんだと思う)
- kubeletは、Podの同期を試みてコントローラーが欠如していること気付く
- kubeletは、cgroupが「消えた」と判断してPodを終了
- kubeletはQoS cgroupを再同期し、ファイルAPIを介してcpusetを追加
- Podは再起動するが、cgroupが破壊されているので動かない
自分が経験した事象に照らし合わせてみる
再起動したらkubectl get xxx
ができた奇跡の時間
コンテナ作成直後からkubeletがsystemdにD-Busメッセージを送信してsystemdの処理が完了するまでは、cgroupの値は正常であっため、この奇跡の時間が生まれた。
kubeletがetcdにSIGTERMを送るあの処理
kubeletは、cgroupのコントローラーが欠如していることに気づいて、「消えた」と判断してPodを終了する為にSIGTERMを送っていた
設定を変更後
ちゃんと2時間動くことを確認
$ kubectl get pod -A
kube-system coredns-6f6b679f8f-7rkfn 0/1 Pending 0 4h4m
kube-system coredns-6f6b679f8f-mw8lr 0/1 Pending 0 4h4m
kube-system etcd-control 1/1 Running 265 (133m ago) 4h5m
kube-system kube-apiserver-control 1/1 Running 257 (135m ago) 4h5m
kube-system kube-controller-manager-control 1/1 Running 269 (134m ago) 4h3m
kube-system kube-proxy-nxdjv 1/1 Running 50 (128m ago) 4h4m
kube-system kube-scheduler-control 1/1 Running 276 (130m ago) 4h5m