DNSクエリの氾濫でkubernetesが突然使えなくなった件

Last updated at 2025-01-20Posted at 2024-08-27

はじめに

6月27日の早朝にuptime-kumaからエラーメールを受け取りました。

現象はゲートウェイを負荷が増大して応答が悪くなったため、このゲートウェイの再起動してから復旧と切断を繰り返すというものでした。

uptime-kumaには06/27 04:23にシステムが停止してから、10:23のゲートウェイの再起動後、断続的にサービスの復旧と切断を繰り返していることが記録されています。ゲートウェイではインターネット側、LAN側の双方の通信においてパケットのDropを確認していました。この時に分かっていたことは、dio/vppからのエラーがログに記録されている、k8sクラスターではnodelocaldnsが再起動を繰り返していることでした。

他のk8sクラスターも同様の構成で、同じタイミングでパッケージの更新が行われていましたが問題は発生していなかったため、原因の特定が構成によるものではなく他の要因も関連していることが推察される状況でした。

この教訓はK8sのノードを(MACアドレスから固定的にIPを払出すとしても)DHCPクライアントとして構成することは止めましょうということです。

環境

Kubernetse v1.28.6 (Kubespray v2.24.1) → v1.29.5 (Kubespray v2.25.0)
Node 1〜6 (Intel N100 x3, AMD 5700U x2, AMD 3350H)

ログの確認

apt/history.log

6/27の午前4時台に更新されたパッケージが怪しいと思ったため、/var/log/apt/history.logを確認したところ、次のような記録が残っていました。

/var/log/apt/history.logからの抜粋

Start-Date: 2024-06-27  04:21:56
Commandline: /usr/bin/apt-get -o quiet=1 dist-upgrade -y -o APT::Get::Show-Upgraded=true
Install: linux-modules-5.15.0-113-generic:amd64 (5.15.0-113.123, automatic
 linux-headers-5.15.0-113:amd64 (5.15.0-113.123, automatic
 linux-modules-extra-5.15.0-113-generic:amd64 (5.15.0-113.123, automatic
 linux-headers-5.15.0-113-generic:amd64 (5.15.0-113.123, automatic
 linux-image-5.15.0-113-generic:amd64 (5.15.0-113.123, automatic)
Upgrade: linux-tools-common:amd64 (5.15.0-112.122, 5.15.0-113.123
 vpp-plugin-core:amd64 (24.02-release, 24.06-release
 linux-headers-generic:amd64 (5.15.0.112.112, 5.15.0.113.113
 vpp:amd64 (24.02-release, 24.06-release
 vpp-plugin-dpdk:amd64 (24.02-release, 24.06-release
 linux-generic:amd64 (5.15.0.112.112, 5.15.0.113.113
 wget:amd64 (1.21.2-2ubuntu1, 1.21.2-2ubuntu1.1
 libvppinfra:amd64 (24.02-release, 24.06-release
 linux-image-generic:amd64 (5.15.0.112.112, 5.15.0.113.113
 netplan.io:amd64 (0.106.1-7ubuntu0.22.04.2, 0.106.1-7ubuntu0.22.04.3
 libnetplan0:amd64 (0.106.1-7ubuntu0.22.04.2, 0.106.1-7ubuntu0.22.04.3)
End-Date: 2024-06-27  04:26:47

vpp関連のファイルが更新されていることが確認できたので、とりあえずここを中心に確認します。

syslog

この時間帯のsyslogの出力は次のようになっています。

/var/log/syslogからの抜粋

Jun 27 04:23:05 ub2204 systemd[1]: Reloading.
Jun 27 04:23:06 ub2204 systemd-udevd[650433]: Using default interface naming scheme 'v249'.
Jun 27 04:23:06 ub2204 networkd-dispatcher[744]: WARNING:Unknown index 20 seen, reloading interface list
Jun 27 04:23:06 ub2204 systemd-networkd[720]: tap0: Link UP
Jun 27 04:23:06 ub2204 systemd-networkd[720]: tap0: Gained carrier
Jun 27 04:23:06 ub2204 vpp[650431]: unknown input `nat44 plugin enable sessions 6...'
Jun 27 04:23:06 ub2204 vpp[650431]: exec: CLI line error: nat44 plugin enable sessions 6...
Jun 27 04:23:06 ub2204 systemd[1]: Configuration file /run/systemd/system/netplan-ovs-cleanup.service is marked world-inaccessible. This has
no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jun 27 04:23:08 ub2204 systemd[1]: Reloading Postfix Mail Transport Agent (instance -)...
Jun 27 04:23:08 ub2204 systemd-networkd[720]: tap0: Gained IPv6LL
...
Jun 27 04:23:12 ub2204 vpp[650431]: clib_time_verify_frequency:308: Rejecting large frequency change of 2.67%
Jun 27 04:23:12 ub2204 vpp[650431]: clib_time_verify_frequency:308: Rejecting large frequency change of 2.52%
Jun 27 04:23:12 ub2204 vpp[650431]: clib_time_verify_frequency:308: Rejecting large frequency change of 2.62%
Jun 27 04:23:21 ub2204 vpp[650431]: clib_time_verify_frequency:308: Rejecting large frequency change of 2.52%
Jun 27 04:23:21 ub2204 vpp[650431]: clib_time_verify_frequency:308: Rejecting large frequency change of 2.67%
Jun 27 04:23:21 ub2204 vpp[650431]: clib_time_verify_frequency:308: Rejecting large frequency change of 2.62%
Jun 27 04:23:29 ub2204 vpp[650431]: clib_time_verify_frequency:308: Rejecting large frequency change of 2.52%
...

これをみたところではvppの更新でネットワークに影響が出たのかと思いましたが、実際には無関係で、"Rejecting large frequency change"メッセージの原因はゲートウェイのCPU負荷の増加が原因でした。

関連資料 https://lists.fd.io/g/vpp-dev/message/20157

nodelocaldnsのログ

k8sをみるとnodelocaldnsのPodに影響が出ているようでした。

nodelocaldnsログ

2024/06/27 04:33:21 [INFO] Starting node-cache image: 1.22.28
2024/06/27 04:33:21 [INFO] Using Corefile /etc/coredns/Corefile
2024/06/27 04:33:21 [INFO] Using Pidfile 
2024/06/27 04:33:21 [ERROR] Failed to read node-cache coreFile /etc/coredns/Corefile.base - open /etc/coredns/Corefile.base: no such file or directory
2024/06/27 04:33:21 [INFO] Skipping kube-dns configmap sync as no directory was specified
.:53 on 169.254.25.10
cluster.local.:53 on 169.254.25.10
in-addr.arpa.:53 on 169.254.25.10
ip6.arpa.:53 on 169.254.25.10
[INFO] plugin/reload: Running configuration SHA512 = aa809f767f97014677c4e010f69a19281bea2a25fd44a8c9172f6f43db27a70080deb3a2add822c680f580337da221a7360acac898a1e8a8827a7bda80e00c2d
CoreDNS-1.10.0
linux/amd64, go1.21.4, 
[FATAL] plugin/loop: Loop (169.254.25.10:42871 -> 169.254.25.10:53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 7975047106611664487.27050
3393689727605."

このログにあるURLを確認するとloopが発生している場合の対処法が記載されています。

https://coredns.io/plugins/loop#troubleshooting

kern.log

node1のkern.log

Jun 27 04:51:46 node1 kernel: [ 6808.354281] thp_fault_alloc 0
Jun 27 04:51:46 node1 kernel: [ 6808.354283] thp_collapse_alloc 0
Jun 27 04:51:46 node1 kernel: [ 6808.354285] Tasks state (memory values in pages):
Jun 27 04:51:46 node1 kernel: [ 6808.354287] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 27 04:51:46 node1 kernel: [ 6808.354291] [ 105267] 65535 105267      257      160    28672        0          -998 pause
Jun 27 04:51:46 node1 kernel: [ 6808.354301] [ 146112]     0 146112   350201    49856   520192        0           998 node-cache
Jun 27 04:51:46 node1 kernel: [ 6808.354308] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-333272b557613d02521025c1f59b745a50690a7671f5ec1324be4b7a0daf8b8f.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9482d960_b335_4835_a0d2_45b4909bd15a.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9482d960_b335_4835_a0d2_45b4909bd15a.slice/cri-containerd-333272b557613d02521025c1f59b745a50690a7671f5ec1324be4b7a0daf8b8f.scope,task=node-cache,pid=146112,uid=0
Jun 27 04:51:46 node1 kernel: [ 6808.354416] Memory cgroup out of memory: Killed process 146112 (node-cache) total-vm:1400804kB, anon-rss:177024kB, file-rss:22400kB, shmem-rss:0kB, UID:0 pgtables:508kB oom_score_adj:998
Jun 27 04:51:46 node1 kernel: [ 6808.355166] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-od9482d960_b335_4835_a0d2_45b4909bd15a.slice/cri-containerd-333272b557613d02521025c1f59b745a50690a7671f5ec1324be4b7a0daf8b8f.scope are going to be killed due to memory.oom.group set
Jun 27 04:51:46 node1 kernel: [ 6808.355181] Memory cgroup out of memory: Killed process 146130 (node-cache) total-vm:1400804kB, anon-rss:177024kB, file-rss:22400kB, shmem-rss:0kB, UID:0 pgtables:508kB oom_score_adj:998

エラーメッセージからコンテナがOOMKillされていることが分かって、これはnodelocaldns Podの状況と一致しました。

https://github.com/kubernetes/dns/issues/334

対応

coredns.ioの記述に従って cm/nodelocaldns ファイルを編集し、:53 { } 内部の forward . /etc/resolv.conf を forward . 192.168.1.1 と直接LANのDNSサーバーを指定しました。

設定を反映するためにnodelocaldnsのPodsを再起動しています。

nodelocaldnsの再起動

$ kubectl -n kube-system rollout restart daemonset/nodelocaldns

これで改善がみられたため、このまま様子をみています。

状況のまとめ

原因は不明なもののnodelocaldnsでループなどによりDNSクエリが増加したことでOOMKillが発生、またゲートウェイの負荷が上昇していることでパケットのDropなど様々な不具合を生じていることが想定できました。

原因の調査

結局のところcoredns.ioに記載されていた対応によって復旧したのですが、他のシステムでも/etc/resolv.confを参照している状況は同じで、いずれのシステムもdnsmasqのDNCPdの配下にあるなど状況も一致していたためどうしてこのような状況になったのかはっきりとは分かっていません。

変更後のconfigmap/nodelocaldnsオブジェクトの抜粋

    .:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.25.10
        forward . 192.168.1.1
        prometheus :9253
    }

他のk8sクラスターのnodelocaldnsの設定を確認しましたが、いずれも/etc/resolv.confを参照して問題は発生していません。

他のクラスターのデフォルト設定のconfigmap/nodelocaldnsオブジェクトの抜粋

    .:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.25.10
        forward . /etc/resolv.conf
        prometheus :9253
    }

dnsmasqが稼動するゲートウェイのapt/history.logを確認してもdnsmasqが更新されているといった状況はありませんでした。

追加調査の結果

問題が発生したシステムの特徴としては登録したユーザーにRBACで権限を付与した作業用のnamespaceとingressからリクエストをforwardするserviceオブジェクトを自動的に作成するようになっています。

このため宣言されているものの、接続が閉じていないingressの設定が相当数ある状況でした。

Ingress(nginx)から設定ファイルにあるproxy_passの先が存在するか確認するため、多くのDNSリクエストが発生している様子が観察されました。

この環境のk8sノードはDHCPdからMACアドレスをベースにIPアドレスが払い出されるようになっているため、FQDNを構成するためのdomainnameが自動的にDHCPのsearch-domainとして渡されるようになっています。

問題が発生した環境のノードに配置されている/etc/resolv.confファイル

nameserver 127.0.0.53
options edns0 trust-ad
search default.svc.cluster.local svc.cluster.local example.org

systemd-resolvedが有効になっているため、ファイルの実体は/run/systemd/resolve/stub-resolv.confです。

このsearch行に余計なドメインが入っているためk8sが名前解決できないホスト名の名前解決がLANのdnsmasqに渡ってしまっています。

dnsmasqサーバーのsyslog行から抜粋

Aug 23 00:46:56 ub2204 dnsmasq[8403]: config user1-svc.user1.svc.cluster.local.example.org is NXDOMAIN
Aug 23 00:46:56 ub2204 dnsmasq[8403]: query[AAAA] user2-svc.user2.svc.cluster.local.example.org from 192.168.1.26
Aug 23 00:46:56 ub2204 dnsmasq[8403]: config user2-svc.user2.svc.cluster.local.example.org is NXDOMAIN
Aug 23 00:46:56 ub2204 dnsmasq[8403]: query[AAAA] user3-svc.user3.svc.cluster.local.example.org from 192.168.1.23

当初はdnsmasq側で上位DNSにqueryを投げずに、すぐにNXDOMAINを返すような設定にする対応をしていたのですが、これらの設定が障害の要因になっているような気がします。

NXDOMAINが返っているのにnodelocaldnsがOOMKIllになることは、追加の調査が必要そうです。

nodelocaldnsがなぜOOMKillやLoopを生じたような状況になってしまったのか

問題のあったノードでresolvectlコマンドを実行すると次のようになっていました。

resolvectl statusコマンドの実行結果から抜粋

$ resolvectl status
Global
         Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
  resolv.conf mode: stub
Current DNS Server: 169.254.25.10
       DNS Servers: 169.254.25.10

Link 2 (enp1s0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.1.1
       DNS Servers: 192.168.1.1
        DNS Domain: example.org

元々のnodelocaldnsの設定で、/etc/resolv.confをみるようになっていたのでDNSのクエリーがループする可能性はあるのですが、他の正常なノードでも同様の設定になっているので単純にDNSクエリの数が多かったというのが根本原因のような気がします。

K8sクラスターのDNSアーキテクチャ

K8sにおけるDNSの仕様については下記の文書に記載があります。

https://github.com/kubernetes/dns/blob/master/docs/specification.md

実際には上記の文書は、基本的な内容のみなので他の文書も探してみます。

これらの文書から、各Podはnodelocaldnsを通してcorednsなどに名前解決を依頼する仕組みであることが分かります。

今回問題が発生した状況で/etc/resolv.confを中心にどのように設定を確認していきます。

containerdによる各PodのDNS設定

実行される各Podの/etc/resolv.confはcontainerdによってシステムのsearchを引き継ぐ形になります。

各Podに設定される/etc/resolv.confの内容

search kubecamp.svc.cluster.local svc.cluster.local cluster.local default.svc.cluster.local example.org
nameserver 169.254.25.10
options ndots:5

nodelocaldnsのDNS設定

各Podが直接アクセスするnodelocaldnsはkube-system namespaceで稼動しますが、設定されるresolv.confの内容は次のようになっています。

nodelocaldnsの/etc/resolv.conf設定

search default.svc.cluster.local svc.cluster.local example.org
nameserver 169.254.25.10
nameserver 192.168.110.10

この192.168.110.10はLANで稼動しているdnsmasqを指しています。

nodelocaldnsが参照するCorefileは通常の内容で一時的に変更したforward行は元に戻しています。

cm/nodelocaldnsに格納されたCorefileの抜粋

...
    cluster.local:53 {
        errors
        cache {
            success 9984 30
            denial 9984 5
        }
        reload
        loop
        bind 169.254.25.10
        forward . 10.233.0.3 {
            force_tcp
        }
        prometheus :9253
        health 169.254.25.10:9254
    }
    ...
    .:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.25.10
        forward . /etc/resolv.conf
        prometheus :9253
    }

nodelocaldnsの説明をみる限りはingressで発生したNXDOMAINな問い合わせはキャッシュされるはずですが、実際には継続してリクエストが送信されています。

DNSクエリの負荷

tcpdumpでport 53 and host node1のようなフィルターで内容を観察すると、1つの行き先のないIngressのExternalName定義あたり、1つのnodelocaldns Podから毎分220件程度のリクエストが観察されています。

全体では16件の接続できないExternalName定義が存在しています。

port番号で該当ホストをlsofコマンドで調査すると、外部のdnsmasqに問い合わせをしているプロセスはnodelocaldnsだけでした。

1つのホストで$ ss -a | grep :domain | wc -lのようなコマンドで確認すると、corednsへの接続も含めてDNS関連で70件程度の接続が常に張られているようになっています。

Ingress側のエラーログを確認すると毎分4000件程度のdns_lookup(): failed to query the DNS server for メッセージが記録されています。

dnsmasqの負荷はそれほどでもありませんが、ログファイルへの書き出しが大量に行われています。

kubectl top podでnodelocaldnsの負荷を確認すると、障害が発生していない状況では次のようになっています。

kuubectl top podの出力抜粋

NAME                                       CPU(cores)   MEMORY(bytes)
...
nodelocaldns-7695x                         109m         11Mi            
nodelocaldns-dkpgc                         153m         11Mi            
nodelocaldns-g4p9h                         946m         16Mi            
nodelocaldns-mq7qr                         106m         11Mi            
nodelocaldns-qw29d                         253m         15Mi            
nodelocaldns-rhwbf                         377m         12Mi

ここまでで負荷が高いであろうことは推測できますが、具体的にループが発生する状況の確認や再現はできませんでした。

最終的な対応

これまではk8sノードも含めてdnsmasqからMACアドレスをベースにして固定IPを払い出していましたが、k8sノードについてはnetplanを利用して固定IPを各ノードに設定するように変更しました。

dnsmasqへのDNSクエリが激減したことで、gateway serverの負荷はわずかですが下がっています。

またdnsmasq固有の設定として、filter-AAAAを有効にしてIPv6の名前解決についてUpstreamにフォワードしない設定を追加しました。

しばらく様子をみていましたが、問題が再現することなく動作しています。

今回の現象はIngressの設定に端を発して、名前解決の件数が増加したことで顕在化したものだろうと思います。

各コンテナがK8sノードの/etc/resolv.confの設定を引き継ぐ点、DHCPサーバーから渡された不要なドメイン名がノードの/etc/resolv.confに含まれてしまう点などを見落していたことが原因だったのだろうと考えています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up