Page MenuHomePhabricator

Review cp2041 and cp2042 running bullseye
Closed, ResolvedPublic

Description

cp2041 and cp2042 have been upgradeed to bullseye, we should validate them before upgraded the whole fleet.

Event Timeline

Vgutierrez triaged this task as Medium priority.
Vgutierrez moved this task from Backlog to Actively Servicing on the Traffic board.

Thanks for observing this issue! I have a theory that https://github.com/google/cadvisor/pull/2868 might help alleviate this and if true, then we should cherry-pick this commit and apply it to our own build of cadvisor's Debian package (which does not exist but surely can if required). I base it on the following:

  1. The errors we are observing are on the recently upgraded bullseye hosts and those two hosts also report this in the cadvisor journalctl log:
Dec 20 23:30:46 cp2042 cadvisor[1065]: W1220 23:30:46.748410    1065 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/prometheus-debian-version-textfile.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 21 07:53:46 cp2042 cadvisor[1065]: W1221 07:53:46.749986    1065 prometheus.go:1824] Couldn't get containers: partial failures: ["/user.slice/user-499.slice/user@499.service/app.slice": containerDataToContainerInfo: unable to find data in memory cache]
Dec 22 03:53:41 cp2042 cadvisor[1065]: W1222 03:53:41.836080    1065 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/systemd-timedated.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 22 14:13:57 cp2042 cadvisor[1065]: W1222 14:13:57.836567    1065 container.go:448] Failed to get RecentStats("/system.slice/systemd-timedated.service") while determining the next housekeeping: unable to find data in memory cache
Dec 04 01:03:00 cp2041 cadvisor[981]: W1204 01:03:00.407409     981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/track_vcache_fds.service": containerDataToContainerInfo: unable to find data in memory cache], ["/system.slice/prometheus_puppet_agent_stats.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 04 01:26:00 cp2041 cadvisor[981]: W1204 01:26:00.411511     981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/confd_prometheus_metrics.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 04 01:36:00 cp2041 cadvisor[981]: W1204 01:36:00.079629     981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/prometheus_varnishd_mmap_count.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 04 01:54:00 cp2041 cadvisor[981]: W1204 01:54:00.078297     981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/prometheus_varnishd_mmap_count.service": containerDataToContainerInfo: unable to find data in memory cache]
Dec 04 02:35:00 cp2041 cadvisor[981]: W1204 02:35:00.411512     981 prometheus.go:1824] Couldn't get containers: partial failures: ["/system.slice/track_vcache_fds.service": containerDataToContainerInfo: unable to find data in memory cache]

A look at the journalctl of the other hosts (not cp2041 or cp2042) does not turn up this error. (Which, going by the error is expected but is a good additional confirmation of the mismatch in the buster and bullseye versions that might be leading to it).

  1. On inspection of this error and comparing the version of cadvisor on the current buster hosts (0.35.0+ds1-4) with the version on bullseye (0.38.7+ds1-2+b7), it seems like this fix was backported to 0.39.x, and that's why it didn't make it to 0.38.x yet.

But that's besides the point in a way and on inspecting the PR above, it has been referenced from similar issues. Looking at the code itself (https://github.com/google/cadvisor/pull/2868/commits/655773dc5ee9cbf48fe3f629f4e724c75dcd569c):

		if err != nil {
			// Ignore the error because of race condition and return best-effort result.
			if err == memory.ErrDataNotFound {
				klog.Warningf("Error getting data for container %s because of race condition", name)
				continue
			}
			return nil, err
		}

Which seems to match the errors we are getting and a possible fix for the same: attempt a best-case fix by handling the memory error and returning whatever is possible. The thing that doesn't make me happy though is that I still don't have the smoking gun as to where or what exactly changed between these version or in the OS upgrade itself that is causing this issue in bullseye but not buster. This might also require an even more closer look at cadvisor's code and I haven't done that yet.

Anyway, that's my summary and I could use a quick review/confirmation before we build 0.38.7-wm1 :)

After an hour of writing the above and digging around some more, I think I am even less convinced now about my own theory about the cause of this issue. One of the main reasons being that I haven't found the reason for the change and also that only one metric seems to have been affected: container_cpu_system_seconds_total. Everything else looks fine so perhaps it's more wise to spend time there.

After an hour of writing the above and digging around some more, I think I am even less convinced now about my own theory about the cause of this issue. One of the main reasons being that I haven't found the reason for the change and also that only one metric seems to have been affected: container_cpu_system_seconds_total. Everything else looks fine so perhaps it's more wise to spend time there.

I should have been more clear: I think it's worthwhile investigating the discrepancy between the logs from cadvisor on buster vs bullseye noted above but I am even less convinced now that they are responsible for the missing metric.

OK, I think I finally found the issue and also confirmed the fix on traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud. The TL;DR is that Debian bullseye supports cgroup v2 by default whereas buster didn't and that breaks cadvisor; the cadvisor fix is https://github.com/google/cadvisor/pull/3030. There are various cadvisor issues on the GitHub repository that simply talk about upgrading to the latest version but don't mention the problem (example: https://github.com/google/cadvisor/issues/3073) so I think hopefully this is it and I am writing this down in case it is helpful for someone else searching for this issue.

Continuing from the long comment above (T325557#8495946): after writing that and continuing to look into this, I realized we were missing some other metrics as well (from localhost:4194/metrics). Example:

container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 0 1672857165461
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 0 1672857153306
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 0 1672857175977

And from the cadvisor journald output, I noticed another thing:

Jan 04 18:31:58 traffic-cache-bullseye cadvisor[14522]: W0104 18:31:58.436137   14522 manager.go:159] Cannot detect current cgroup on cgroup v2

Which is just a warning in the code in manager/manager.go:

if cgroups.IsCgroup2UnifiedMode() {
    klog.Warningf("Cannot detect current cgroup on cgroup v2")
} else {

But this was not present in cadvisor's journald output on the buster hosts and that made me think if something changed in the cgroup and indeed it did: bullseye ships with cgroup v2 by default and cadvisor fails to get the per-process metrics. To confirm this theory, on traffic-cache-bullseye.traffic I modified /etc/default/grub, added systemd.unified_cgroup_hierarchy=0, rebooted and I could get the metrics above which were failing for me earlier. (Setting systemd.unified_cgroup_hierarchy to 0 enables cgroup v1).

Since we cannot and should not disable this (reverting to cgroup v1) in the kernel, thankfully there is a cadvisor fix for cgroup v2 whereby they use the correct unified cgroup path. We will need to build cadvisor with this patch to alleviate this issue.

We cannot simply backport this patch though as it doesn't apply cleanly and the code in container/common/helpers.gohas significantly changed between the version in Debian bullseye and master. So we should either build and use 0.46.0 (the latest), or we should find a clean path to apply https://github.com/google/cadvisor/commit/a18098e9376b0c06ea1806d6558aabd4dd5909f3 to 0.38.7+ds1-2.

Metrics (default on bullseye, cgroup v2):

container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 0 1672859971947
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 0 1672859955268
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 0 1672859966965
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend.service"} 0 1672859971858

Metrics after setting cgroup to v1 with systemd.unified_cgroup_hierarchy=0 on the same bullseye host:

container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 0.13 1672860161800
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 0.13 1672860161086
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 0.1 1672860161580
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend.service"} 1.73 1672860161679

systemd.unified_cgroup_hierarchy=0 enables systemd's "hybrid" mode, meaning that both v1 and v2 are enabled. Systemd makes its opinion very clear at https://systemd.io/CGROUP_DELEGATION/:

This mode is a stopgap. Don’t bother with this too much unless you have too much free time. To say this clearly, legacy and hybrid modes have no future. If you develop software today and don’t focus on the unified mode, then you are writing software for yesterday, not tomorrow. They are primarily supported for compatibility reasons and will not receive new features. Sorry.

While Debian may have switched to Unified in recent releases, hybrid is still a viable solution to provide the legacy interface until upstream decides to get their act together. :)

As anecdata, I used to use hybrid some years ago since Docker was still on v1 while utilizing other programs that used v2 and it was simple and reliable to just set systemd.unified_cgroup_hierarchy=0. It might be easiest to just set that rather than build our own package.

systemd.unified_cgroup_hierarchy=0 enables systemd's "hybrid" mode, meaning that both v1 and v2 are enabled. Systemd makes its opinion very clear at https://systemd.io/CGROUP_DELEGATION/:

This mode is a stopgap. Don’t bother with this too much unless you have too much free time. To say this clearly, legacy and hybrid modes have no future. If you develop software today and don’t focus on the unified mode, then you are writing software for yesterday, not tomorrow. They are primarily supported for compatibility reasons and will not receive new features. Sorry.

While Debian may have switched to Unified in recent releases, hybrid is still a viable solution to provide the legacy interface until upstream decides to get their act together. :)

As anecdata, I used to use hybrid some years ago since Docker was still on v1 while utilizing other programs that used v2 and it was simple and reliable to just set systemd.unified_cgroup_hierarchy=0. It might be easiest to just set that rather than build our own package.

Thanks for sharing the link and the further context! I guess systemd's documentation is pretty strong on using legacy/hybrid mode as well so I am not sure if that is the right path forward versus just updating cadvisor and letting it properly use v2. But it's a good point of discussion so let's see if someone else has a better idea here.

Another data point: It seems that some metrics are lost when moving to v2:

https://github.com/google/cadvisor/issues/3062

On nodes running cgroup v1 the following metrics such as container_cpu_cfs_throttled_* are returned. If cadvisor is run on nodes with cgroup v2 enabled those metrics are not returned.

and

[…] Changing back to cgroup v1 restores also other metrics like container_memory_max_usage_bytes

https://bugzilla.redhat.com/show_bug.cgi?id=1796543 also shows that cpuacct.usage_percpu is lost.

Another data point: It seems that some metrics are lost when moving to v2:

https://github.com/google/cadvisor/issues/3062

On nodes running cgroup v1 the following metrics such as container_cpu_cfs_throttled_* are returned. If cadvisor is run on nodes with cgroup v2 enabled those metrics are not returned.

and

[…] Changing back to cgroup v1 restores also other metrics like container_memory_max_usage_bytes

https://bugzilla.redhat.com/show_bug.cgi?id=1796543 also shows that cpuacct.usage_percpu is lost.

Thanks, I came across the same when researching this issue. I don't think we are using any of these metrics -- I may be wrong -- and also for the ones that are lost, it seems like we can't do anything about those anyway (and cadvisor does take that into account, so no issue there).

Update from the maintainer: the package is no longer being maintained in Debian so we will build our own.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027994#10

On Fri, Jan 6, 2023 at 2:00 AM Sukhbir Singh
<ssingh+debian@wikimedia.org> wrote:

a new release of the cadvisor package is warranted.

The current version in sid is broken for a long time. And it won't
make it to bookworm.

I no longer use cadvisor. So it really needs a new maintainer.

Shengjing Zhu

Change 880529 had a related patch set uploaded (by Ssingh; author: Ssingh):

[integration/config@master] zuul: configure CI for operations/debs/cadvisor

https://gerrit.wikimedia.org/r/880529

Change 880530 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/cadvisor@master] Release 0.44.0+ds1-1

https://gerrit.wikimedia.org/r/880530

Change 880529 merged by jenkins-bot:

[integration/config@master] Zuul: [operations/debs/cadvisor] Add debian-glue-unstable CI

https://gerrit.wikimedia.org/r/880529

Mentioned in SAL (#wikimedia-releng) [2023-01-17T00:21:52Z] <James_F> Zuul: [operations/debs/cadvisor] Add debian-glue-unstable CI for T325557

Change 880530 merged by Ssingh:

[operations/debs/cadvisor@master] Release 0.44.0+ds1-1

https://gerrit.wikimedia.org/r/880530

Thanks to Faidon's suggestion of building against 0.44.0 and not 0.46.0, we have a working cadvisor 0.44.0 build for bullseye/sid, which has been merged above.

Mentioned in SAL (#wikimedia-operations) [2023-01-17T16:32:32Z] <sukhe> reprepro --ignore=wrongdistribution -C main include bullseye-wikimedia cadvisor_0.44.0+ds1-1_amd64.changes: T325557

Change 881689 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/cadvisor@master] Release 0.44.0+ds1-2

https://gerrit.wikimedia.org/r/881689

Change 881689 merged by Ssingh:

[operations/debs/cadvisor@master] Release 0.44.0+ds1-1~wmf1

https://gerrit.wikimedia.org/r/881689

Mentioned in SAL (#wikimedia-operations) [2023-01-20T13:00:42Z] <sukhe> reprepro --ignore=wrongdistribution -C main include bullseye-wikimedia cadvisor_0.44.0+ds1-1~wmf1_amd64.changes: T325557

sukhe@cumin2002:~$ sudo cumin 'A:cp and A:bullseye' 
3 hosts will be targeted:
cp[2041-2042].codfw.wmnet,cp5032.eqsin.wmnet

All three current bullseye hosts have been upgraded to the version of cadvisor that fixes the collection of metrics.

On cp2041 (example):

container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 10926.512945 1674230561511
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-hospital.service"} 467.054097 1674230561359
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-slowlog.service"} 10728.920982 1674230560076
container_cpu_system_seconds_total{id="/system.slice/varnish-frontend.service"} 629772.592211 1674230561345

both cp2041 and cp2042 look good to me. I haven't found any reason that would prevent upgrading to bullseye

Looks like this can be resolved!