Page MenuHomePhabricator

deployment_server bullseye - mw-cgroup.service: Failed
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Create a bullseye VM in cloud VPS and apply the puppet role deployment_server (and the needed Hiera data)

What happens?:

When the puppet agent runs, among other issues, systemd fails to start the service mw-cgroup due to "/sys/fs/cgroup/memory/release_agent: Permission denied"

May 01 22:23:36 deploy-1006 systemd[1]: Starting "Mw-cgroup"...
May 01 22:23:36 deploy-1006 mw-cgroup[735545]: /bin/bash: line 1: /sys/fs/cgroup/memory/release_agent: Permission denied
May 01 22:23:36 deploy-1006 systemd[1]: mw-cgroup.service: Main process exited, code=exited, status=1/FAILURE
May 01 22:23:36 deploy-1006 systemd[1]: mw-cgroup.service: Failed with result 'exit-code'.
May 01 22:23:36 deploy-1006 systemd[1]: Failed to start "Mw-cgroup".

What should have happened instead?:

The service mw-cgroup should start normally (or puppet should not try to start it).

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Debian bullseye

Other information (browser name/version, screenshots, etc.):

https://gerrit.wikimedia.org/r/c/operations/puppet/+/991347/5/modules/mediawiki/manifests/cgroup.pp seems related but the grub::bootparam doesn't apply in cloud? is it only an issue in cloud VPS and not production?

Event Timeline

I now found T325228#9445729 which seems like the exact same issue on snapshot hosts.

Mentioned in SAL (#wikimedia-cloud) [2024-05-01T22:45:10Z] <mutante> rebooting deploy-1006 to see if issue with starting mw-cgroup goes away - .. and it did! - it must have been the grub config from https://gerrit.wikimedia.org/r/c/operations/puppet/+/991347 but needed one reboot - T363957

I rebooted the VM and the issue went away! The grub config from the change above was applied apparently:

root@deploy-1006:/# grep -Eo systemd.unified_cgroup_hierarchy=0 /boot/grub/grub.cfg 
systemd.unified_cgroup_hierarchy=0
Notice: /Stage[main]/Mediawiki::Cgroup/Base::Service_unit[mw-cgroup]/Service[mw-cgroup]/ensure: ensure changed 'stopped' to 'running'

You just have to know you need that one extra reboot after applying certain roles.

Thanks for documenting this, ran into the same thing in deployment prep (T327742), reboot also fixed it there.