cp2041 and cp2042 have been upgradeed to bullseye, we should validate them before upgraded the whole fleet.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | BCornwall | T263797 Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') | |||
| Resolved | ssingh | T321309 Upgrade Traffic hosts to bullseye | |||
| Resolved | Vgutierrez | T325557 Review cp2041 and cp2042 running bullseye |
Event Timeline
- Monitoring issue: CPU seconds for haproxy, varnish and ATS is reported as 0 on bullseye hosts: https://grafana.wikimedia.org/goto/eCGKNUc4k?orgId=1, impacted metric name: container_cpu_system_seconds_total
Thanks for observing this issue! I have a theory that https://github.com/google/cadvisor/pull/2868 might help alleviate this and if true, then we should cherry-pick this commit and apply it to our own build of cadvisor's Debian package (which does not exist but surely can if required). I base it on the following:
- The errors we are observing are on the recently upgraded bullseye hosts and those two hosts also report this in the cadvisor journalctl log:
Dec 20 23:30:46 cp2042 cadvisor[1065]: W1220 23:30:46.748410 1065 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/prometheus-debian-version-textfile.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 21 07:53:46 cp2042 cadvisor[1065]: W1221 07:53:46.749986 1065 prometheus.go:1824] Couldn't get containers: partial failures: ["/user.slice/user-499.slice/user@499.service/app.slice": containerDataToContainerInfo: unable to find data in memory cache]
Dec 22 03:53:41 cp2042 cadvisor[1065]: W1222 03:53:41.836080 1065 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/systemd-timedated.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 22 14:13:57 cp2042 cadvisor[1065]: W1222 14:13:57.836567 1065 container.go:448] Failed to get RecentStats("/system.slice/systemd-timedated.service") while determining the next housekeeping: unable to find data in memory cacheDec 04 01:03:00 cp2041 cadvisor[981]: W1204 01:03:00.407409 981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/track_vcache_fds.service": containerDataToContainerInfo: unable to find data in memory cache], ["/system.slice/prometheus_puppet_agent_stats.service": containerDataToContainerInfo: unable to find data in memory cache] Dec 04 01:26:00 cp2041 cadvisor[981]: W1204 01:26:00.411511 981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/confd_prometheus_metrics.service": containerDataToContainerInfo: unable to find data in memory cache] Dec 04 01:36:00 cp2041 cadvisor[981]: W1204 01:36:00.079629 981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/prometheus_varnishd_mmap_count.service": containerDataToContainerInfo: unable to find data in memory cache] Dec 04 01:54:00 cp2041 cadvisor[981]: W1204 01:54:00.078297 981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/prometheus_varnishd_mmap_count.service": containerDataToContainerInfo: unable to find data in memory cache] Dec 04 02:35:00 cp2041 cadvisor[981]: W1204 02:35:00.411512 981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/track_vcache_fds.service": containerDataToContainerInfo: unable to find data in memory cache]
A look at the journalctl of the other hosts (not cp2041 or cp2042) does not turn up this error. (Which, going by the error is expected but is a good additional confirmation of the mismatch in the buster and bullseye versions that might be leading to it).
- On inspection of this error and comparing the version of cadvisor on the current buster hosts (0.35.0+ds1-4) with the version on bullseye (0.38.7+ds1-2+b7), it seems like this fix was backported to 0.39.x, and that's why it didn't make it to 0.38.x yet.
But that's besides the point in a way and on inspecting the PR above, it has been referenced from similar issues. Looking at the code itself (https://github.com/google/cadvisor/pull/2868/commits/655773dc5ee9cbf48fe3f629f4e724c75dcd569c):
if err != nil {
// Ignore the error because of race condition and return best-effort result.
if err == memory.ErrDataNotFound {
klog.Warningf("Error getting data for container %s because of race condition", name)
continue
}
return nil, err
}Which seems to match the errors we are getting and a possible fix for the same: attempt a best-case fix by handling the memory error and returning whatever is possible. The thing that doesn't make me happy though is that I still don't have the smoking gun as to where or what exactly changed between these version or in the OS upgrade itself that is causing this issue in bullseye but not buster. This might also require an even more closer look at cadvisor's code and I haven't done that yet.
Anyway, that's my summary and I could use a quick review/confirmation before we build 0.38.7-wm1 :)
After an hour of writing the above and digging around some more, I think I am even less convinced now about my own theory about the cause of this issue. One of the main reasons being that I haven't found the reason for the change and also that only one metric seems to have been affected: container_cpu_system_seconds_total. Everything else looks fine so perhaps it's more wise to spend time there.
I should have been more clear: I think it's worthwhile investigating the discrepancy between the logs from cadvisor on buster vs bullseye noted above but I am even less convinced now that they are responsible for the missing metric.
OK, I think I finally found the issue and also confirmed the fix on traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud. The TL;DR is that Debian bullseye supports cgroup v2 by default whereas buster didn't and that breaks cadvisor; the cadvisor fix is https://github.com/google/cadvisor/pull/3030. There are various cadvisor issues on the GitHub repository that simply talk about upgrading to the latest version but don't mention the problem (example: https://github.com/google/cadvisor/issues/3073) so I think hopefully this is it and I am writing this down in case it is helpful for someone else searching for this issue.
Continuing from the long comment above (T325557#8495946): after writing that and continuing to look into this, I realized we were missing some other metrics as well (from localhost:4194/metrics). Example:
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 0 1672857165461
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 0 1672857153306
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 0 1672857175977And from the cadvisor journald output, I noticed another thing:
Jan 04 18:31:58 traffic-cache-bullseye cadvisor[14522]: W0104 18:31:58.436137 14522 manager.go:159] Cannot detect current cgroup on cgroup v2
Which is just a warning in the code in manager/manager.go:
if cgroups.IsCgroup2UnifiedMode() {
klog.Warningf("Cannot detect current cgroup on cgroup v2")
} else {But this was not present in cadvisor's journald output on the buster hosts and that made me think if something changed in the cgroup and indeed it did: bullseye ships with cgroup v2 by default and cadvisor fails to get the per-process metrics. To confirm this theory, on traffic-cache-bullseye.traffic I modified /etc/default/grub, added systemd.unified_cgroup_hierarchy=0, rebooted and I could get the metrics above which were failing for me earlier. (Setting systemd.unified_cgroup_hierarchy to 0 enables cgroup v1).
Since we cannot and should not disable this (reverting to cgroup v1) in the kernel, thankfully there is a cadvisor fix for cgroup v2 whereby they use the correct unified cgroup path. We will need to build cadvisor with this patch to alleviate this issue.
We cannot simply backport this patch though as it doesn't apply cleanly and the code in container/common/helpers.gohas significantly changed between the version in Debian bullseye and master. So we should either build and use 0.46.0 (the latest), or we should find a clean path to apply https://github.com/google/cadvisor/commit/a18098e9376b0c06ea1806d6558aabd4dd5909f3 to 0.38.7+ds1-2.
Metrics (default on bullseye, cgroup v2):
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 0 1672859971947
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 0 1672859955268
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 0 1672859966965
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend.service"} 0 1672859971858Metrics after setting cgroup to v1 with systemd.unified_cgroup_hierarchy=0 on the same bullseye host:
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 0.13 1672860161800
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 0.13 1672860161086
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 0.1 1672860161580
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend.service"} 1.73 1672860161679https://salsa.debian.org/systemd-team/systemd/-/commit/170fb124a32884bd9975ee4ea9e1ffbbc2ee26b4
- -Ddefault-hierarchy=hybrid \ + -Ddefault-hierarchy=unified \
systemd.unified_cgroup_hierarchy=0 enables systemd's "hybrid" mode, meaning that both v1 and v2 are enabled. Systemd makes its opinion very clear at https://systemd.io/CGROUP_DELEGATION/:
This mode is a stopgap. Don’t bother with this too much unless you have too much free time. To say this clearly, legacy and hybrid modes have no future. If you develop software today and don’t focus on the unified mode, then you are writing software for yesterday, not tomorrow. They are primarily supported for compatibility reasons and will not receive new features. Sorry.
While Debian may have switched to Unified in recent releases, hybrid is still a viable solution to provide the legacy interface until upstream decides to get their act together. :)
As anecdata, I used to use hybrid some years ago since Docker was still on v1 while utilizing other programs that used v2 and it was simple and reliable to just set systemd.unified_cgroup_hierarchy=0. It might be easiest to just set that rather than build our own package.
Thanks for sharing the link and the further context! I guess systemd's documentation is pretty strong on using legacy/hybrid mode as well so I am not sure if that is the right path forward versus just updating cadvisor and letting it properly use v2. But it's a good point of discussion so let's see if someone else has a better idea here.
Another data point: It seems that some metrics are lost when moving to v2:
https://github.com/google/cadvisor/issues/3062
On nodes running cgroup v1 the following metrics such as container_cpu_cfs_throttled_* are returned. If cadvisor is run on nodes with cgroup v2 enabled those metrics are not returned.
and
[…] Changing back to cgroup v1 restores also other metrics like container_memory_max_usage_bytes
https://bugzilla.redhat.com/show_bug.cgi?id=1796543 also shows that cpuacct.usage_percpu is lost.
Thanks, I came across the same when researching this issue. I don't think we are using any of these metrics -- I may be wrong -- and also for the ones that are lost, it seems like we can't do anything about those anyway (and cadvisor does take that into account, so no issue there).
Filed a bug report with Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027994
Update from the maintainer: the package is no longer being maintained in Debian so we will build our own.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027994#10
On Fri, Jan 6, 2023 at 2:00 AM Sukhbir Singh
<ssingh+debian@wikimedia.org> wrote:a new release of the cadvisor package is warranted.
The current version in sid is broken for a long time. And it won't
make it to bookworm.I no longer use cadvisor. So it really needs a new maintainer.
Shengjing Zhu
Change 880529 had a related patch set uploaded (by Ssingh; author: Ssingh):
[integration/config@master] zuul: configure CI for operations/debs/cadvisor
Change 880530 had a related patch set uploaded (by Ssingh; author: Ssingh):
[operations/debs/cadvisor@master] Release 0.44.0+ds1-1
Change 880529 merged by jenkins-bot:
[integration/config@master] Zuul: [operations/debs/cadvisor] Add debian-glue-unstable CI
Mentioned in SAL (#wikimedia-releng) [2023-01-17T00:21:52Z] <James_F> Zuul: [operations/debs/cadvisor] Add debian-glue-unstable CI for T325557
Change 880530 merged by Ssingh:
[operations/debs/cadvisor@master] Release 0.44.0+ds1-1
Thanks to Faidon's suggestion of building against 0.44.0 and not 0.46.0, we have a working cadvisor 0.44.0 build for bullseye/sid, which has been merged above.
Mentioned in SAL (#wikimedia-operations) [2023-01-17T16:32:32Z] <sukhe> reprepro --ignore=wrongdistribution -C main include bullseye-wikimedia cadvisor_0.44.0+ds1-1_amd64.changes: T325557
Change 881689 had a related patch set uploaded (by Ssingh; author: Ssingh):
[operations/debs/cadvisor@master] Release 0.44.0+ds1-2
Change 881689 merged by Ssingh:
[operations/debs/cadvisor@master] Release 0.44.0+ds1-1~wmf1
Mentioned in SAL (#wikimedia-operations) [2023-01-20T13:00:42Z] <sukhe> reprepro --ignore=wrongdistribution -C main include bullseye-wikimedia cadvisor_0.44.0+ds1-1~wmf1_amd64.changes: T325557
sukhe@cumin2002:~$ sudo cumin 'A:cp and A:bullseye' 3 hosts will be targeted: cp[2041-2042].codfw.wmnet,cp5032.eqsin.wmnet
All three current bullseye hosts have been upgraded to the version of cadvisor that fixes the collection of metrics.
On cp2041 (example):
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 10926.512945 1674230561511
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 467.054097 1674230561359
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 10728.920982 1674230560076
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend.service"} 629772.592211 1674230561345both cp2041 and cp2042 look good to me. I haven't found any reason that would prevent upgrading to bullseye