We have a fairly elaborate cgroup config layout for Trusty that depends on cgrulesd and related things. All that ended with systemd. It can theoretically work as a delegated service, but it's not a good idea and prone to breakage. Systemd-based cgroup configs need to be built to enable similar restrictions to keep bastions working well.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Bstorm | T199271 Upgrade the tools gridengine system | |||
Resolved | • Bstorm | T204530 cloudvps: tools and toolsbeta trusty deprecation | |||
Resolved | • Bstorm | T200557 Create a stretch and Son of Grid Engine grid in toolsbeta | |||
Resolved | aborrero | T210098 Port cgroup restrictions and definitions to systemd/stretch for bastions | |||
Resolved | aborrero | T215154 Toolforge: sgebastion: systemd resource control not working | |||
Invalid | None | T218720 Add a blkio restriction to bastion cgroups |
Event Timeline
This is the part I'm talking about to be specific on this ticket. Actually using the cgred module breaks things badly and required basically rebuilding the bastion, so we don't want to re-enable any of that comment. That's just what I altered slightly to make it "stretch-friendly" as far as the rules went. They just don't work as is. I kept it as a comment for reference while working on it.
@Bstorm is right. I've been doing a bit of investigation. The way we are using cgred is to dynamically add resource control limits to arbitrary processes (utilities, shells, etc). This is by using a rule match and if that procs match the rule, then we add it to the cgroup. For this, there is a daemon running and scanning the running proc list for matches. Apparently, there is no correspondence for this workflow in the systemd world (and they consider cgred obsolete anyway, mmm).
The closer we may get to that workflow is by using slices, but I'm not sure how we could use the proc matching. Still investigating.
Mentioned in SAL (#wikimedia-cloud) [2018-11-26T13:25:37Z] <arturo> T210098 install systemd239 from stretch-backports and restart VM
Mentioned in SAL (#wikimedia-cloud) [2018-11-26T13:25:59Z] <arturo> T210098 VM=toolsbeta-sgebastion-03
Ok, I have a native systemd solution that may fulfill our use case (kind of) to set resource limits per user.
- install systemd 239 from stretch-backports (since we need this commit https://github.com/poettering/systemd/commit/deb94ad51567fbc51f227fe5ed44f123fe27f01c and 239 is the only version in Debian that includes it)
- create a file /etc/systemd/system/user-.slice.d/limits.conf, with slice configuration that will be applied per user, for example:
[Slice] CPUAccounting=true CPUQuota=10% MemoryAccounting=true MemoryLimit=600M
- run systemd daemon-reload
- check new limits:
sudo systemctl status user-18194.slice ● user-18194.slice - User Slice of UID 18194 Loaded: loaded Drop-In: /lib/systemd/system/user-.slice.d └─10-defaults.conf /etc/systemd/system/user-.slice.d └─limits.conf Active: active since Mon 2018-11-26 13:32:36 UTC; 22s ago Tasks: 8 (limit: 10383) Memory: 20.3M (limit: 600.0M) CPU: 230ms [...]
Now that we know a way for systemd to constraint resource for users, we may need to translate all the old policies we had in modules/toollabs/manifests/bastion.pp to this new scheme.
As you see, this solution implements per-user global resource control, as opposed to per-user per-process control (i.e, user X running PHP, user Y running Python).
I don't think losing a bit of granularity is a big deal in this case. What is a big deal is that this approach only works in Debian stretch running systemd from stretch-backports.
Change 476003 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: introduce bastion systemd-based resource control
Change 476003 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: introduce bastion systemd-based resource control
Change 476007 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: resource control: fix typo
Change 476007 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: resource control: fix typo
We may need an explicit override for user-0.slice (root) otherwise root get same limit applied, which makes very painful running puppet agent for example.
BTW the OOM killer due to user trying to break the memory limit really works :-)
[Tue Nov 27 12:57:12 2018] stress invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=0 [Tue Nov 27 12:57:12 2018] stress cpuset=/ mems_allowed=0 [Tue Nov 27 12:57:12 2018] CPU: 0 PID: 26425 Comm: stress Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u6 [Tue Nov 27 12:57:12 2018] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.9.3-20161116_142049-atsina 04/01/2014 [Tue Nov 27 12:57:12 2018] 0000000000000000 ffffffffb0331e54 ffffa3bec1c63dd8 ffff98a086611000 [Tue Nov 27 12:57:12 2018] ffffffffb0205270 0000000000000000 0000000000000000 ffff98a0b68f6000 [Tue Nov 27 12:57:12 2018] 0000000000000000 ffffffffb0ef26b0 ffffffffb01fdadc 00000004ffffffff [Tue Nov 27 12:57:12 2018] Call Trace: [Tue Nov 27 12:57:12 2018] [<ffffffffb0331e54>] ? dump_stack+0x5c/0x78 [Tue Nov 27 12:57:12 2018] [<ffffffffb0205270>] ? dump_header+0x78/0x1fd [Tue Nov 27 12:57:12 2018] [<ffffffffb01fdadc>] ? mem_cgroup_scan_tasks+0xcc/0x100 [Tue Nov 27 12:57:12 2018] [<ffffffffb0185d8a>] ? oom_kill_process+0x21a/0x3e0 [Tue Nov 27 12:57:12 2018] [<ffffffffb0186221>] ? out_of_memory+0x111/0x470 [Tue Nov 27 12:57:12 2018] [<ffffffffb01f8cd9>] ? mem_cgroup_out_of_memory+0x49/0x80 [Tue Nov 27 12:57:12 2018] [<ffffffffb01fe495>] ? mem_cgroup_oom_synchronize+0x325/0x340 [Tue Nov 27 12:57:12 2018] [<ffffffffb01f9860>] ? mem_cgroup_css_reset+0xd0/0xd0 [Tue Nov 27 12:57:12 2018] [<ffffffffb01865af>] ? pagefault_out_of_memory+0x2f/0x80 [Tue Nov 27 12:57:12 2018] [<ffffffffb00614ad>] ? __do_page_fault+0x4bd/0x4f0 [Tue Nov 27 12:57:12 2018] [<ffffffffb06113d2>] ? schedule+0x32/0x80 [Tue Nov 27 12:57:12 2018] [<ffffffffb0617048>] ? async_page_fault+0x28/0x30 [Tue Nov 27 12:57:12 2018] Task in /user.slice/user-18194.slice killed as a result of limit of /user.slice/user-18194.slice [Tue Nov 27 12:57:12 2018] memory: usage 153600kB, limit 153600kB, failcnt 53755 [Tue Nov 27 12:57:12 2018] memory+swap: usage 675988kB, limit 9007199254740988kB, failcnt 0 [Tue Nov 27 12:57:12 2018] kmem: usage 3632kB, limit 9007199254740988kB, failcnt 0 [Tue Nov 27 12:57:12 2018] Memory cgroup stats for /user.slice/user-18194.slice: cache:0KB rss:149968KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:4KB swap:522388KB inactive_anon:74944KB active_anon:74916KB inactive_file:0KB active_file:0KB unevictable:0KB [Tue Nov 27 12:57:12 2018] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [Tue Nov 27 12:57:12 2018] [20461] 0 20461 27614 1769 58 3 21 0 sshd [Tue Nov 27 12:57:12 2018] [20488] 18194 20488 27614 1089 54 3 50 0 sshd [Tue Nov 27 12:57:12 2018] [20489] 18194 20489 8836 764 16 3 808 0 bash [Tue Nov 27 12:57:12 2018] [23111] 18194 23111 17272 1590 37 4 266 0 systemd [Tue Nov 27 12:57:12 2018] [23113] 18194 23113 63835 516 55 3 85 0 (sd-pam) [Tue Nov 27 12:57:12 2018] [26423] 18194 26423 1819 200 9 3 22 0 stress [Tue Nov 27 12:57:12 2018] [26424] 18194 26424 1819 4 8 3 20 0 stress [Tue Nov 27 12:57:12 2018] [26425] 18194 26425 263964 19239 168 4 62207 0 stress [Tue Nov 27 12:57:12 2018] [26426] 18194 26426 1819 4 8 3 20 0 stress [Tue Nov 27 12:57:12 2018] [26427] 18194 26427 263964 18022 176 4 67384 0 stress [Tue Nov 27 12:57:12 2018] Memory cgroup out of memory: Kill process 26427 (stress) score 505 or sacrifice child [Tue Nov 27 12:57:12 2018] Killed process 26427 (stress) total-vm:1055856kB, anon-rss:71876kB, file-rss:212kB, shmem-rss:0kB [Tue Nov 27 12:57:12 2018] oom_reaper: reaped process 26427 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Change 476012 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: split slice config for root/users
Change 476012 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: split slice config for root/users
Ok, I think we are all set for now. The implementation looks elegant and robust, but we may need to fine tune limits as we go live with the new Toolforge.
@Bstorm feel free to reopen if you think we need further work on this.
Change 553122 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce systemd cgroup memory limits for stat1004