Page MenuHomePhabricator

Cadvisor may be breaking Kubernetes worker nodes
Closed, ResolvedPublic

Description

The cadvisor rollout (T108027) seems to break Kuberenetes nodes (the kubelet, specifically). When rolled out to ml-staging2001, the kubelet stopped repsonding and the node was tagged unreachable. The systemd unit for the kubelet is now an alias for cadvisor and starts that binary:

$ systemctl cat kubelet
# /lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet Server
Documentation=https://github.com/kubernetes/kubernetes
Documentation=man:kubelet
After=network.target
After=docker.service
Requires=docker.service
Conflicts=cadvisor.service

[Service]
EnvironmentFile=-/etc/default/%p
WorkingDirectory=/var/lib/kubelet
ExecStart=/usr/bin/kubelet $DAEMON_ARGS
Restart=on-failure

[Install]
WantedBy=multi-user.target
Alias=cadvisor.service

# /etc/systemd/system/cadvisor.service.d/puppet-override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/cadvisor --listen_ip=10.192.0.201 --port=4194 --enable_metrics=accelerator,app,cpu,disk,diskIO,memory,network,oom_event,perf_event

In contrast, on an unaffected machine:

$ systemctl cat kubelet
# /lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet Server
Documentation=https://github.com/kubernetes/kubernetes
Documentation=man:kubelet
After=network.target
After=docker.service
Requires=docker.service
Conflicts=cadvisor.service

[Service]
EnvironmentFile=-/etc/default/%p
WorkingDirectory=/var/lib/kubelet
ExecStart=/usr/bin/kubelet $DAEMON_ARGS
Restart=on-failure

[Install]
WantedBy=multi-user.target
Alias=cadvisor.service

Note how it still has Alias=cadvisor.servive, but runs a different binary.

Event Timeline

Reporting what we discussed on IRC (me Filippo and Tobias):

  • The kubelet systemd unit seems to have an Alias to cadvisor.service.
  • When we apply an override via puppet to cadvisor, the kubelet gets its ExecStart overridden and it disappears.

The Alias seems to be from this commit:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/kubernetes/+/477ac9742258bf26348b269befb06db828978b98%5E%21/debian/kubernetes-node.kubelet.service

It was added a long time ago, maybe we should remove the Alias? I don't find a reason to have it from what I see in upstream configs etc.., but let's be sure.

@akosiaris thoughts?

This problem of course affects all hosts running cadvisor and kubelet at the moment, i.e.

# cumin 'P{R:package = cadvisor} and P{R:package = kubernetes-node}' 
5 hosts will be targeted:
dse-k8s-worker[1006-1007].eqiad.wmnet,kubernetes2011.codfw.wmnet,ml-serve1005.eqiad.wmnet,ml-staging2001.c
odfw.wmnet

My suggested fix is to remove /etc/systemd/system/cadvisor.service if it is a symlink, this way kubelet will be able to start, and systemd won't reinstall the symlink until kubernetes-node is upgraded (and we'll remove Alias=cadvisor from its next version)

Mentioned in SAL (#wikimedia-operations) [2023-05-31T13:35:20Z] <godog> rm cadvisor.service symlink/alias and restart kubelet on affected hosts - T337836

Change 924963 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/kubernetes@v1.23] debian: remove cadvisor from the kubelet's systemd unit

https://gerrit.wikimedia.org/r/924963

My fix provided to be only a partial one: I had to apply a version of https://gerrit.wikimedia.org/r/c/operations/debs/kubernetes/+/924963 in advance on the five hosts affected, which definitely fixed the issue

Hey, I guess I could also comment on T108027, but this his probably a better place. So, copying from my response in the gerrit change pointed above:

The reason was that until 1.12 kubelet had cadvisor integrated in it. One could pass --cadvisor-port and get a fully functional cadvisor web ui. We had experimented with that, but never enabled it fully as prometheus+grafana were clearly on the way to become a better UI and the kubelet was anyway exposing the cadvisor metrics. cadvisor is still integrated into the Kubelet and afaik there are no plans to remove that, but the --cadvisor-port argument no longer applies, past 1.12. 

Per https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.12.md 
  The formerly publicly-available cAdvisor web UI that the kubelet started using -- 
  cadvisor-port has been entirely removed in 1.12. The recommended way to run cAdvisor 
  if you still need it, is via a DaemonSet. (#65707, @dims)

 
That being said, what reason do we have to run cadvisor on kubernetes nodes? The cadvisor metrics are already exported by the kubelet, we are duplicating effort here.

All of the metrics listed in T108027 are exposed by the kubelet and are being scraped by prometheus@k8s for years now. We even have tons of dashboards referencing them in grafana. So, while the gerrit change is technically ok, as in "we should be able to run cadvisor and the kubelet simultaneously", the duplication quite possibly isn't worth it.

Hey, I guess I could also comment on T108027, but this his probably a better place. So, copying from my response in the gerrit change pointed above:

The reason was that until 1.12 kubelet had cadvisor integrated in it. One could pass --cadvisor-port and get a fully functional cadvisor web ui. We had experimented with that, but never enabled it fully as prometheus+grafana were clearly on the way to become a better UI and the kubelet was anyway exposing the cadvisor metrics. cadvisor is still integrated into the Kubelet and afaik there are no plans to remove that, but the --cadvisor-port argument no longer applies, past 1.12. 

Per https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.12.md 
  The formerly publicly-available cAdvisor web UI that the kubelet started using -- 
  cadvisor-port has been entirely removed in 1.12. The recommended way to run cAdvisor 
  if you still need it, is via a DaemonSet. (#65707, @dims)

 
That being said, what reason do we have to run cadvisor on kubernetes nodes? The cadvisor metrics are already exported by the kubelet, we are duplicating effort here.

All of the metrics listed in T108027 are exposed by the kubelet and are being scraped by prometheus@k8s for years now. We even have tons of dashboards referencing them in grafana. So, while the gerrit change is technically ok, as in "we should be able to run cadvisor and the kubelet simultaneously", the duplication quite possibly isn't worth it.

Thank you for the context; it is true that the kubelet-exposed metrics would be duplicated, in my mind the main advantage in having the duplication in ops is two fold:

  • Isolation of k8s-specific metrics (i.e. in "base" cadvisor we won't export the container-specific labels attached by k8s)
  • Symmetry/consistency: in the same spirit of k8s-backed dashboards one of my goals is to have the same for the rest of the fleet; having the metrics in ops too helps with that because there's no need to special case k8s hosts

hope that makes sense!

Change 924963 merged by Alexandros Kosiaris:

[operations/debs/kubernetes@v1.23] debian: remove cadvisor from the kubelet's systemd unit

https://gerrit.wikimedia.org/r/924963

Hey, I guess I could also comment on T108027, but this his probably a better place. So, copying from my response in the gerrit change pointed above:

The reason was that until 1.12 kubelet had cadvisor integrated in it. One could pass --cadvisor-port and get a fully functional cadvisor web ui. We had experimented with that, but never enabled it fully as prometheus+grafana were clearly on the way to become a better UI and the kubelet was anyway exposing the cadvisor metrics. cadvisor is still integrated into the Kubelet and afaik there are no plans to remove that, but the --cadvisor-port argument no longer applies, past 1.12. 

Per https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.12.md 
  The formerly publicly-available cAdvisor web UI that the kubelet started using -- 
  cadvisor-port has been entirely removed in 1.12. The recommended way to run cAdvisor 
  if you still need it, is via a DaemonSet. (#65707, @dims)

 
That being said, what reason do we have to run cadvisor on kubernetes nodes? The cadvisor metrics are already exported by the kubelet, we are duplicating effort here.

All of the metrics listed in T108027 are exposed by the kubelet and are being scraped by prometheus@k8s for years now. We even have tons of dashboards referencing them in grafana. So, while the gerrit change is technically ok, as in "we should be able to run cadvisor and the kubelet simultaneously", the duplication quite possibly isn't worth it.

Thank you for the context; it is true that the kubelet-exposed metrics would be duplicated, in my mind the main advantage in having the duplication in ops is two fold:

  • Isolation of k8s-specific metrics (i.e. in "base" cadvisor we won't export the container-specific labels attached by k8s)
  • Symmetry/consistency: in the same spirit of k8s-backed dashboards one of my goals is to have the same for the rest of the fleet; having the metrics in ops too helps with that because there's no need to special case k8s hosts

hope that makes sense!

Hi, yes that makes sense. However, we 've never run cadvisor and kubelet side-by-side, and although we probably won't have any problems, I 'd like us to proceed carefully with this, enabling it cluster by cluster, leaving wikikube for the end.

Package built and upload to apt. Package tested now on kubernetes1017. I guess the next steps are to install everywhere, reload systemd, restart the kubelet for good measure and then start scaling out the cadvisor deployment. Objections?

Mentioned in SAL (#wikimedia-operations) [2023-06-02T09:18:21Z] <akosiaris> update kubernetes-node to 1.23.14-2 on all P:kubernetes::node hosts (88 in total) T337836. Reload systemd for unit changes to take effect

Package built and upload to apt. Package tested now on kubernetes1017. I guess the next steps are to install everywhere, reload systemd, restart the kubelet for good measure and then start scaling out the cadvisor deployment. Objections?

+1 thanks a lot for the work!

hope that makes sense!

Hi, yes that makes sense. However, we 've never run cadvisor and kubelet side-by-side, and although we probably won't have any problems, I 'd like us to proceed carefully with this, enabling it cluster by cluster, leaving wikikube for the end.

That makes sense to me!

Package built and upload to apt. Package tested now on kubernetes1017. I guess the next steps are to install everywhere, reload systemd, restart the kubelet for good measure and then start scaling out the cadvisor deployment. Objections?

Thank you, no objections on my end

With @BTullis we met this again on kubestage1004 today. I had to re-verify everything here, and apparently, despite having upgraded the package to not contain Alias=cadvisor.service, the symlink wasn't deleted (it does get removed on a package removal interestingly). The reason it was just kubestage1004 and not kubestage1003 as well, is presumably the 20% rollout percentage.

Anyway I 've went ahead and deleted the symlink across the fleet with a cumin call. 14 nodes, the following, did not have the symlink already

aux-k8s-worker1002.eqiad.wmnet,dse-k8s-worker[1001,1006-1007].eqiad.wmnet,kubernetes[2011-2012,2014,2019].codfw.wmnet,kubernetes1006.eqiad.wmnet,kubestage1004.eqiad.wmnet,ml-serve[2003,2008].codfw.wmnet,ml-serve1005.eqiad.wmnet,ml-staging2001.codfw.wmnet

I think we are ok on this one now.

Change 942420 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

Change 942420 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

Change 942426 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

Change 942426 merged by JMeybohm:

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

JMeybohm claimed this task.

We would like to have the data do make a more informed decision in T277876 so I took the liberty of rolling out the remaining patch after discussing with @fgiunchedi. Manual validation on wikikube, dse and ml nodes looks fine.