Cadvisor may be breaking Kubernetes worker nodes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	klausman
	May 31 2023, 10:21 AM

Description

The cadvisor rollout (T108027) seems to break Kuberenetes nodes (the kubelet, specifically). When rolled out to ml-staging2001, the kubelet stopped repsonding and the node was tagged unreachable. The systemd unit for the kubelet is now an alias for cadvisor and starts that binary:

$ systemctl cat kubelet
# /lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet Server
Documentation=https://github.com/kubernetes/kubernetes
Documentation=man:kubelet
After=network.target
After=docker.service
Requires=docker.service
Conflicts=cadvisor.service

[Service]
EnvironmentFile=-/etc/default/%p
WorkingDirectory=/var/lib/kubelet
ExecStart=/usr/bin/kubelet $DAEMON_ARGS
Restart=on-failure

[Install]
WantedBy=multi-user.target
Alias=cadvisor.service

# /etc/systemd/system/cadvisor.service.d/puppet-override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/cadvisor --listen_ip=10.192.0.201 --port=4194 --enable_metrics=accelerator,app,cpu,disk,diskIO,memory,network,oom_event,perf_event

In contrast, on an unaffected machine:

$ systemctl cat kubelet
# /lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet Server
Documentation=https://github.com/kubernetes/kubernetes
Documentation=man:kubelet
After=network.target
After=docker.service
Requires=docker.service
Conflicts=cadvisor.service

[Service]
EnvironmentFile=-/etc/default/%p
WorkingDirectory=/var/lib/kubelet
ExecStart=/usr/bin/kubelet $DAEMON_ARGS
Restart=on-failure

[Install]
WantedBy=multi-user.target
Alias=cadvisor.service

Note how it still has Alias=cadvisor.servive, but runs a different binary.

Details

Subject	Repo	Branch	Lines +/-
hieradata: complete cadvisor rollout on k8s	operations/puppet	production	+0 -9
hieradata: finish cadvisor rollout on k8s-aux	operations/puppet	production	+0 -3
debian: remove cadvisor from the kubelet's systemd unit	operations/debs/kubernetes	v1.23	+6 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		fgiunchedi	T108027 Collect per-cgroup cpu/mem and other system level metrics
		Resolved		JMeybohm	T337836 Cadvisor may be breaking Kubernetes worker nodes

Event Timeline

klausman created this task.May 31 2023, 10:21 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 31 2023, 10:21 AM

klausman added a parent task: T108027: Collect per-cgroup cpu/mem and other system level metrics.May 31 2023, 10:21 AM

Reporting what we discussed on IRC (me Filippo and Tobias):

The kubelet systemd unit seems to have an Alias to cadvisor.service.
When we apply an override via puppet to cadvisor, the kubelet gets its ExecStart overridden and it disappears.

The Alias seems to be from this commit:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/kubernetes/+/477ac9742258bf26348b269befb06db828978b98%5E%21/debian/kubernetes-node.kubelet.service

It was added a long time ago, maybe we should remove the Alias? I don't find a reason to have it from what I see in upstream configs etc.., but let's be sure.

@akosiaris thoughts?

fgiunchedi mentioned this in T337856: Stop cadvisor from collecting extra metrics from docker.May 31 2023, 12:11 PM

JMeybohm added projects: Kubernetes, Prod-Kubernetes, serviceops-radar.May 31 2023, 12:37 PM

This problem of course affects all hosts running cadvisor and kubelet at the moment, i.e.

# cumin 'P{R:package = cadvisor} and P{R:package = kubernetes-node}' 
5 hosts will be targeted:
dse-k8s-worker[1006-1007].eqiad.wmnet,kubernetes2011.codfw.wmnet,ml-serve1005.eqiad.wmnet,ml-staging2001.c
odfw.wmnet

My suggested fix is to remove /etc/systemd/system/cadvisor.service if it is a symlink, this way kubelet will be able to start, and systemd won't reinstall the symlink until kubernetes-node is upgraded (and we'll remove Alias=cadvisor from its next version)

Mentioned in SAL (#wikimedia-operations) [2023-05-31T13:35:20Z] <godog> rm cadvisor.service symlink/alias and restart kubelet on affected hosts - T337836

Change 924963 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/kubernetes@v1.23] debian: remove cadvisor from the kubelet's systemd unit

https://gerrit.wikimedia.org/r/924963

gerritbot added a project: Patch-For-Review.May 31 2023, 2:43 PM

My fix provided to be only a partial one: I had to apply a version of https://gerrit.wikimedia.org/r/c/operations/debs/kubernetes/+/924963 in advance on the five hosts affected, which definitely fixed the issue

Hey, I guess I could also comment on T108027, but this his probably a better place. So, copying from my response in the gerrit change pointed above:

The reason was that until 1.12 kubelet had cadvisor integrated in it. One could pass --cadvisor-port and get a fully functional cadvisor web ui. We had experimented with that, but never enabled it fully as prometheus+grafana were clearly on the way to become a better UI and the kubelet was anyway exposing the cadvisor metrics. cadvisor is still integrated into the Kubelet and afaik there are no plans to remove that, but the --cadvisor-port argument no longer applies, past 1.12. 

Per https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.12.md 
  The formerly publicly-available cAdvisor web UI that the kubelet started using -- 
  cadvisor-port has been entirely removed in 1.12. The recommended way to run cAdvisor 
  if you still need it, is via a DaemonSet. (#65707, @dims)

 
That being said, what reason do we have to run cadvisor on kubernetes nodes? The cadvisor metrics are already exported by the kubelet, we are duplicating effort here.

All of the metrics listed in T108027 are exposed by the kubelet and are being scraped by prometheus@k8s for years now. We even have tons of dashboards referencing them in grafana. So, while the gerrit change is technically ok, as in "we should be able to run cadvisor and the kubelet simultaneously", the duplication quite possibly isn't worth it.

In T337836#8895257, @akosiaris wrote:
Hey, I guess I could also comment on T108027, but this his probably a better place. So, copying from my response in the gerrit change pointed above:
The reason was that until 1.12 kubelet had cadvisor integrated in it. One could pass --cadvisor-port and get a fully functional cadvisor web ui. We had experimented with that, but never enabled it fully as prometheus+grafana were clearly on the way to become a better UI and the kubelet was anyway exposing the cadvisor metrics. cadvisor is still integrated into the Kubelet and afaik there are no plans to remove that, but the --cadvisor-port argument no longer applies, past 1.12. 

Per https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.12.md 
  The formerly publicly-available cAdvisor web UI that the kubelet started using -- 
  cadvisor-port has been entirely removed in 1.12. The recommended way to run cAdvisor 
  if you still need it, is via a DaemonSet. (#65707, @dims)

 
That being said, what reason do we have to run cadvisor on kubernetes nodes? The cadvisor metrics are already exported by the kubelet, we are duplicating effort here.
All of the metrics listed in T108027 are exposed by the kubelet and are being scraped by prometheus@k8s for years now. We even have tons of dashboards referencing them in grafana. So, while the gerrit change is technically ok, as in "we should be able to run cadvisor and the kubelet simultaneously", the duplication quite possibly isn't worth it.

Thank you for the context; it is true that the kubelet-exposed metrics would be duplicated, in my mind the main advantage in having the duplication in ops is two fold:

Isolation of k8s-specific metrics (i.e. in "base" cadvisor we won't export the container-specific labels attached by k8s)
Symmetry/consistency: in the same spirit of k8s-backed dashboards one of my goals is to have the same for the rest of the fleet; having the metrics in ops too helps with that because there's no need to special case k8s hosts

hope that makes sense!

Change 924963 merged by Alexandros Kosiaris:

[operations/debs/kubernetes@v1.23] debian: remove cadvisor from the kubelet's systemd unit

https://gerrit.wikimedia.org/r/924963

In T337836#8895502, @fgiunchedi wrote:
In T337836#8895257, @akosiaris wrote:
Hey, I guess I could also comment on T108027, but this his probably a better place. So, copying from my response in the gerrit change pointed above:
The reason was that until 1.12 kubelet had cadvisor integrated in it. One could pass --cadvisor-port and get a fully functional cadvisor web ui. We had experimented with that, but never enabled it fully as prometheus+grafana were clearly on the way to become a better UI and the kubelet was anyway exposing the cadvisor metrics. cadvisor is still integrated into the Kubelet and afaik there are no plans to remove that, but the --cadvisor-port argument no longer applies, past 1.12. 

Per https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.12.md 
  The formerly publicly-available cAdvisor web UI that the kubelet started using -- 
  cadvisor-port has been entirely removed in 1.12. The recommended way to run cAdvisor 
  if you still need it, is via a DaemonSet. (#65707, @dims)

 
That being said, what reason do we have to run cadvisor on kubernetes nodes? The cadvisor metrics are already exported by the kubelet, we are duplicating effort here.
All of the metrics listed in T108027 are exposed by the kubelet and are being scraped by prometheus@k8s for years now. We even have tons of dashboards referencing them in grafana. So, while the gerrit change is technically ok, as in "we should be able to run cadvisor and the kubelet simultaneously", the duplication quite possibly isn't worth it.
Thank you for the context; it is true that the kubelet-exposed metrics would be duplicated, in my mind the main advantage in having the duplication in ops is two fold:

Isolation of k8s-specific metrics (i.e. in "base" cadvisor we won't export the container-specific labels attached by k8s)

Symmetry/consistency: in the same spirit of k8s-backed dashboards one of my goals is to have the same for the rest of the fleet; having the metrics in ops too helps with that because there's no need to special case k8s hosts

hope that makes sense!

Hi, yes that makes sense. However, we 've never run cadvisor and kubelet side-by-side, and although we probably won't have any problems, I 'd like us to proceed carefully with this, enabling it cluster by cluster, leaving wikikube for the end.

Package built and upload to apt. Package tested now on kubernetes1017. I guess the next steps are to install everywhere, reload systemd, restart the kubelet for good measure and then start scaling out the cadvisor deployment. Objections?

Maintenance_bot removed a project: Patch-For-Review.Jun 2 2023, 9:10 AM

Mentioned in SAL (#wikimedia-operations) [2023-06-02T09:18:21Z] <akosiaris> update kubernetes-node to 1.23.14-2 on all P:kubernetes::node hosts (88 in total) T337836. Reload systemd for unit changes to take effect

In T337836#8898039, @akosiaris wrote:

Package built and upload to apt. Package tested now on kubernetes1017. I guess the next steps are to install everywhere, reload systemd, restart the kubelet for good measure and then start scaling out the cadvisor deployment. Objections?

+1 thanks a lot for the work!

In T337836#8897995, @akosiaris wrote:

hope that makes sense!

Hi, yes that makes sense. However, we 've never run cadvisor and kubelet side-by-side, and although we probably won't have any problems, I 'd like us to proceed carefully with this, enabling it cluster by cluster, leaving wikikube for the end.

That makes sense to me!

In T337836#8898039, @akosiaris wrote:

Package built and upload to apt. Package tested now on kubernetes1017. I guess the next steps are to install everywhere, reload systemd, restart the kubelet for good measure and then start scaling out the cadvisor deployment. Objections?

Thank you, no objections on my end

With @BTullis we met this again on kubestage1004 today. I had to re-verify everything here, and apparently, despite having upgraded the package to not contain Alias=cadvisor.service, the symlink wasn't deleted (it does get removed on a package removal interestingly). The reason it was just kubestage1004 and not kubestage1003 as well, is presumably the 20% rollout percentage.

Anyway I 've went ahead and deleted the symlink across the fleet with a cumin call. 14 nodes, the following, did not have the symlink already

aux-k8s-worker1002.eqiad.wmnet,dse-k8s-worker[1001,1006-1007].eqiad.wmnet,kubernetes[2011-2012,2014,2019].codfw.wmnet,kubernetes1006.eqiad.wmnet,kubestage1004.eqiad.wmnet,ml-serve[2003,2008].codfw.wmnet,ml-serve1005.eqiad.wmnet,ml-staging2001.codfw.wmnet

I think we are ok on this one now.

akosiaris mentioned this in T338493: Deployment of developer-portal into 'staging' k8s cluster failing due to insufficient cpu and node taints.Jun 12 2023, 5:35 PM

Change 942420 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

gerritbot added a project: Patch-For-Review.Jul 27 2023, 12:37 PM

Change 942420 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

Change 942426 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

JMeybohm mentioned this in T277876: Reserve resources for system daemons on kubernetes nodes.Aug 14 2023, 12:36 PM

JMeybohm subscribed.Aug 14 2023, 12:52 PM

Change 942426 merged by JMeybohm:

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

We would like to have the data do make a more informed decision in T277876 so I took the liberty of rolling out the remaining patch after discussing with @fgiunchedi. Manual validation on wikikube, dse and ml nodes looks fine.

Maintenance_bot removed a project: Patch-For-Review.Aug 15 2023, 7:30 AM

Cadvisor may be breaking Kubernetes worker nodesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Cadvisor may be breaking Kubernetes worker nodes
Closed, ResolvedPublic
Actions

Related Objects
Search...