kubernetes
HorizontalPodAutoscaler

Kubernetesクラスタでmetrics-serverを導入してkubectl topやHPA(Horizontal Pod Autoscaler)を有効にする

kubeadmなどを使って自前で構築したkubernetesを使って色々な機能を試していると、初期状態では

kubectl top [node/pod]で結果が得られなかったり、

kubectl autoscaleでオートスケーリングさせようにもPodのCPU使用率などをウォッチできずにスケールされないかと思います。

HPAの公式ドキュメントをみているとkubernetes v1.11からはmetrics serverを動かしてやればいい(それ以前はHeapsterだが現在は非推奨)という情報があり、

試しに動かしてみたけどやっぱり取れない…ということがありました。

私が探した範囲では日本語記事にはコレ!という記事は見つからず、GithubのIssueや英語のブログなどを見て解決することができたので、書き残しておこうと思います。


TL;DR

metrics-server-deployment.yamlのmetrics-server Deploymentに4行追加して、実行コマンドにオプションをつけてApplyしましょう。


metrics-server-deployment.yaml

apiVersion: extensions/v1beta1

kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.2
imagePullPolicy: Always
+ command:
+ - /metrics-server
+ - --kubelet-insecure-tls
+ - --kubelet-preferred-address-types=InternalDNS,InternalIP,ExternalDNS,ExternalIP,Hostname
volumeMounts:
- name: tmp-dir
mountPath: /tmp


環境

少し前にkubernetes.ioのブログにVagrant/Ansible/kubeadmを使ってkubernetes環境を構築する記事があったので、

コレを元に少し手を入れたコードで環境構築をしています。(利用されるようでしたら環境に合わせてスペック等を変更してお使いください)

https://github.com/chataro0/k8s_on_vagrant_by_ansible

デプロイ完了すると以下のような構成になります。

$ kubectl get node

NAME STATUS ROLES AGE VERSION
k8s-master Ready master 45d v1.14.1
node-1 Ready <none> 45d v1.14.1
node-2 Ready <none> 45d v1.14.1


Before

まず初期状態でkubectl topコマンドを実行してみると、もちろん値が取れません。(エラーメッセージはhttp:heapsterを要求するんですね。。)

$ kubectl top node

Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)

$ kubectl top pod
Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)


インストール

metrics-serverはgithubからソースcloneしてくるか、releaseからアーカイブをダウンロードしてきましょう。

(余談ですがreleaseのバージョンはv0.3.3ですが、コンテナイメージのバージョンはなぜかv0.3.2でした。。)

wget https://github.com/kubernetes-incubator/metrics-server/archive/v0.3.3.tar.gz

tar xzf v0.3.3.tar.gz

その後、以下コマンドを実行するとmetrics-serverが起動するのですが、デフォルトのままだと…

$ kubectl apply -f metrics-server-0.3.3/deploy/1.8+/

clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
serviceaccount/metrics-server created
deployment.extensions/metrics-server created
service/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created

以下のように値は取得できず、logsを見てみるとエラーが発生しているのが見て取れるかと思います。

$ kubectl top node

error: metrics not available yet
$ kubectl top pod
W0528 08:32:38.400545 14926 top_pod.go:259] Metrics not available for pod default/busybox-b6f5c9d7c-kfzcr, age: 2h10m48.394664877s
error: Metrics not available for pod default/busybox-b6f5c9d7c-kfzcr, age: 2h10m48.394664877s
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
web Deployment/web <unknown>/20% 1 10 1 15m
$ kubectl logs -n kube-system metrics-server-548456b4cd-p2w2n
I0528 08:29:38.244495 1 serving.go:273] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
[restful] 2019/05/28 08:29:38 log.go:33: [restful/swagger] listing is available at https://:443/swaggerapi
[restful] 2019/05/28 08:29:38 log.go:33: [restful/swagger] https://:443/swaggerui/ is mapped to folder /swagger-ui/
I0528 08:29:38.879673 1 serve.go:96] Serving securely on [::]:443
E0528 08:30:19.226871 1 reststorage.go:129] unable to fetch node metrics for node "node-2": no metrics known for node
E0528 08:30:19.226889 1 reststorage.go:129] unable to fetch node metrics for node "k8s-master": no metrics known for node
E0528 08:30:19.226893 1 reststorage.go:129] unable to fetch node metrics for node "node-1": no metrics known for node
E0528 08:30:28.811230 1 reststorage.go:148] unable to fetch pod metrics for pod default/busybox-b6f5c9d7c-kfzcr: no metrics known for pod
E0528 08:30:28.811285 1 reststorage.go:148] unable to fetch pod metrics for pod default/web-7b97d49c8c-fknv6: no metrics known for pod
E0528 08:30:39.159993 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:node-1: unable to fetch metrics from Kubelet node-1 (node-1): Get https://node-1:10250/stats/summary/: dial tcp: lookup node-1 on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:k8s-master: unable to fetch metrics from Kubelet k8s-master (k8s-master): Get https://k8s-master:10250/stats/summary/: dial tcp: lookup k8s-master on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:node-2: unable to fetch metrics from Kubelet node-2 (node-2): Get https://node-2:10250/stats/summary/: dial tcp: lookup node-2 on 10.96.0.10:53: server misbehaving]
E0528 08:31:38.956402 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:k8s-master: unable to fetch metrics from Kubelet k8s-master (k8s-master): Get https://k8s-master:10250/stats/summary/: dial tcp: lookup k8s-master on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:node-2: unable to fetch metrics from Kubelet node-2 (node-2): Get https://node-2:10250/stats/summary/: dial tcp: lookup node-2 on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:node-1: unable to fetch metrics from Kubelet node-1 (node-1): Get https://node-1:10250/stats/summary/: dial tcp: lookup node-1 on 10.96.0.10:53: server misbehaving]
E0528 08:32:28.029247 1 reststorage.go:129] unable to fetch node metrics for node "k8s-master": no metrics known for node
E0528 08:32:28.029419 1 reststorage.go:129] unable to fetch node metrics for node "node-1": no metrics known for node
E0528 08:32:28.029488 1 reststorage.go:129] unable to fetch node metrics for node "node-2": no metrics known for node
E0528 08:32:32.997577 1 reststorage.go:148] unable to fetch pod metrics for pod default/busybox-b6f5c9d7c-kfzcr: no metrics known for pod
E0528 08:32:32.997600 1 reststorage.go:148] unable to fetch pod metrics for pod default/web-7b97d49c8c-fknv6: no metrics known for pod
E0528 08:32:39.136985 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:node-2: unable to fetch metrics from Kubelet node-2 (node-2): Get https://node-2:10250/stats/summary/: dial tcp: lookup node-2 on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:k8s-master: unable to fetch metrics from Kubelet k8s-master (k8s-master): Get https://k8s-master:10250/stats/summary/: dial tcp: lookup k8s-master on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:node-1: unable to fetch metrics from Kubelet node-1 (node-1): Get https://node-1:10250/stats/summary/: dial tcp: lookup node-1 on 10.96.0.10:53: server misbehaving]


調査

エラーメッセージを元にmetrics-serverのIssueを見たり、ググって出てきたブログを読んでみると、

TLS証明書のエラーや名前解決ができていないことが問題のようでした。

metrics-server-deploymentで作成されるpodはDockerfileに書かれているように

エントリーポイントとして/metrics-serverコマンドがオプションなしで実行されているのですが、

これに以下の2つのオプションを追加してあげればいい、ということでした。


  • --kubelet-preferred-address-types=InternalDNS,InternalIP,ExternalDNS,ExternalIP,Hostname

  • --kubelet-insecure-tls

(ブログにもあったようにkubelet-preferred-address-typesInternalIPのみでも問題なく動きましたが、今回は複数書いてます)


修正&適用

具体的にはTL;DRに書いているようにdeployment.spec.template.spec.containers[0].commandを追加してあげればいいです。


metrics-server-0.3.3/deploy/1.8+/metrics-server-deployment.yaml

apiVersion: extensions/v1beta1

kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.2
imagePullPolicy: Always
+ command:
+ - /metrics-server
+ - --kubelet-insecure-tls
+ - --kubelet-preferred-address-types=InternalDNS,InternalIP,ExternalDNS,ExternalIP,Hostname
volumeMounts:
- name: tmp-dir
mountPath: /tmp

ファイルを編集して、新規なり更新なりでディレクトリをApplyしてあげます。(例では更新)

$ kubectl apply -f metrics-server-0.3.3/deploy/1.8+/

clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader unchanged
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged
serviceaccount/metrics-server unchanged
deployment.extensions/metrics-server configured
service/metrics-server unchanged
clusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged


After

Apply直後はメトリクスの不足辺りなのかわかりませんが値は取得できませんが、

1分程度経つと各種メトリクスが取得できるようになります。

$ kubectl top node

NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k8s-master 415m 20% 734Mi 82%
node-1 645m 32% 586Mi 65%
node-2 81m 4% 587Mi 65%
$ kubectl top pod
NAME CPU(cores) MEMORY(bytes)
busybox-b6f5c9d7c-kfzcr 1m 0Mi
web-7b97d49c8c-7dd6z 0m 2Mi
web-7b97d49c8c-cdtwg 0m 3Mi
web-7b97d49c8c-fknv6 201m 2Mi
web-7b97d49c8c-jdzxn 0m 2Mi
web-7b97d49c8c-v4rsj 0m 2Mi

Deployment(resourcesでCPUにlimitsかrequestsを設定しないと値が<unknown>のままでした)とhpaを設定してあげると、


deploy.yml

apiVersion: apps/v1

kind: Deployment
metadata:
labels:
run: web
name: web
spec:
replicas: 1
selector:
matchLabels:
run: web
template:
metadata:
labels:
run: web
spec:
containers:
- image: nginx
name: web
resources:
limits:
cpu: 200m


hpa.yml

apiVersion: autoscaling/v1

kind: HorizontalPodAutoscaler
metadata:
name: web
spec:
maxReplicas: 10
minReplicas: 1
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: web
targetCPUUtilizationPercentage: 20

きちんとオートスケールされ、イベントログにもその様子が表示されました。

$  kubectl get hpa

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
web Deployment/web 19%/20% 1 10 5 99m

$ kubectl get events | sort -r
LAST SEEN TYPE REASON OBJECT MESSAGE
8m53s Normal SuccessfulRescale horizontalpodautoscaler/web New size: 2; reason: cpu resource utilization (percentage of request) above target
8m53s Normal ScalingReplicaSet deployment/web Scaled up replica set web-7b97d49c8c to 2
8m52s Normal SuccessfulCreate replicaset/web-7b97d49c8c Created pod: web-7b97d49c8c-v4rsj
8m52s Normal Scheduled pod/web-7b97d49c8c-v4rsj Successfully assigned default/web-7b97d49c8c-v4rsj to node-2
8m49s Normal Pulling pod/web-7b97d49c8c-v4rsj Pulling image "nginx"
8m46s Normal Started pod/web-7b97d49c8c-v4rsj Started container web
8m46s Normal Pulled pod/web-7b97d49c8c-v4rsj Successfully pulled image "nginx"
8m46s Normal Created pod/web-7b97d49c8c-v4rsj Created container web
5m59s Normal SuccessfulRescale horizontalpodautoscaler/web New size: 3; reason: cpu resource utilization (percentage of request) above target
5m59s Normal SuccessfulCreate replicaset/web-7b97d49c8c Created pod: web-7b97d49c8c-jdzxn
5m59s Normal Scheduled pod/web-7b97d49c8c-jdzxn Successfully assigned default/web-7b97d49c8c-jdzxn to node-2
5m59s Normal ScalingReplicaSet deployment/web Scaled up replica set web-7b97d49c8c to 3
5m56s Normal Pulling pod/web-7b97d49c8c-jdzxn Pulling image "nginx"
5m53s Normal Started pod/web-7b97d49c8c-jdzxn Started container web
5m53s Normal Pulled pod/web-7b97d49c8c-jdzxn Successfully pulled image "nginx"
5m53s Normal Created pod/web-7b97d49c8c-jdzxn Created container web
4m58s Normal SuccessfulRescale horizontalpodautoscaler/web New size: 5; reason: cpu resource utilization (percentage of request) above target
4m57s Normal SuccessfulCreate replicaset/web-7b97d49c8c Created pod: web-7b97d49c8c-cdtwg
4m57s Normal SuccessfulCreate replicaset/web-7b97d49c8c Created pod: web-7b97d49c8c-7dd6z
4m57s Normal Scheduled pod/web-7b97d49c8c-cdtwg Successfully assigned default/web-7b97d49c8c-cdtwg to node-1
4m57s Normal Scheduled pod/web-7b97d49c8c-7dd6z Successfully assigned default/web-7b97d49c8c-7dd6z to node-1
4m57s Normal ScalingReplicaSet deployment/web Scaled up replica set web-7b97d49c8c to 5
4m55s Normal Pulling pod/web-7b97d49c8c-cdtwg Pulling image "nginx"
4m54s Normal Pulling pod/web-7b97d49c8c-7dd6z Pulling image "nginx"
4m52s Normal Pulled pod/web-7b97d49c8c-cdtwg Successfully pulled image "nginx"
4m52s Normal Created pod/web-7b97d49c8c-cdtwg Created container web
4m51s Normal Started pod/web-7b97d49c8c-cdtwg Started container web
4m49s Normal Pulled pod/web-7b97d49c8c-7dd6z Successfully pulled image "nginx"
4m49s Normal Created pod/web-7b97d49c8c-7dd6z Created container web
4m48s Normal Started pod/web-7b97d49c8c-7dd6z Started container web