Page MenuHomePhabricator

deployment_server bullseye - mw-cgroup.service: Failed
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Create a bullseye VM in cloud VPS and apply the puppet role deployment_server (and the needed Hiera data)

What happens?:

When the puppet agent runs, among other issues, systemd fails to start the service mw-cgroup due to "/sys/fs/cgroup/memory/release_agent: Permission denied"

May 01 22:23:36 deploy-1006 systemd[1]: Starting "Mw-cgroup"...
May 01 22:23:36 deploy-1006 mw-cgroup[735545]: /bin/bash: line 1: /sys/fs/cgroup/memory/release_agent: Permission denied
May 01 22:23:36 deploy-1006 systemd[1]: mw-cgroup.service: Main process exited, code=exited, status=1/FAILURE
May 01 22:23:36 deploy-1006 systemd[1]: mw-cgroup.service: Failed with result 'exit-code'.
May 01 22:23:36 deploy-1006 systemd[1]: Failed to start "Mw-cgroup".

What should have happened instead?:

The service mw-cgroup should start normally (or puppet should not try to start it).

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Debian bullseye

Other information (browser name/version, screenshots, etc.):

https://gerrit.wikimedia.org/r/c/operations/puppet/+/991347/5/modules/mediawiki/manifests/cgroup.pp seems related but the grub::bootparam doesn't apply in cloud? is it only an issue in cloud VPS and not production?

Event Timeline

I now found T325228#9445729 which seems like the exact same issue on snapshot hosts.

Mentioned in SAL (#wikimedia-cloud) [2024-05-01T22:45:10Z] <mutante> rebooting deploy-1006 to see if issue with starting mw-cgroup goes away - .. and it did! - it must have been the grub config from https://gerrit.wikimedia.org/r/c/operations/puppet/+/991347 but needed one reboot - T363957

I rebooted the VM and the issue went away! The grub config from the change above was applied apparently:

root@deploy-1006:/# grep -Eo systemd.unified_cgroup_hierarchy=0 /boot/grub/grub.cfg 
systemd.unified_cgroup_hierarchy=0
Notice: /Stage[main]/Mediawiki::Cgroup/Base::Service_unit[mw-cgroup]/Service[mw-cgroup]/ensure: ensure changed 'stopped' to 'running'

You just have to know you need that one extra reboot after applying certain roles.