Page MenuHomePhabricator

Port cgroup restrictions and definitions to systemd/stretch for bastions
Closed, ResolvedPublic

Description

We have a fairly elaborate cgroup config layout for Trusty that depends on cgrulesd and related things. All that ended with systemd. It can theoretically work as a delegated service, but it's not a good idea and prone to breakage. Systemd-based cgroup configs need to be built to enable similar restrictions to keep bastions working well.

Event Timeline

Bstorm triaged this task as Medium priority.Nov 21 2018, 5:37 PM
Bstorm created this task.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/toolforge/bastion.pp#233

This is the part I'm talking about to be specific on this ticket. Actually using the cgred module breaks things badly and required basically rebuilding the bastion, so we don't want to re-enable any of that comment. That's just what I altered slightly to make it "stretch-friendly" as far as the rules went. They just don't work as is. I kept it as a comment for reference while working on it.

@Bstorm is right. I've been doing a bit of investigation. The way we are using cgred is to dynamically add resource control limits to arbitrary processes (utilities, shells, etc). This is by using a rule match and if that procs match the rule, then we add it to the cgroup. For this, there is a daemon running and scanning the running proc list for matches. Apparently, there is no correspondence for this workflow in the systemd world (and they consider cgred obsolete anyway, mmm).

The closer we may get to that workflow is by using slices, but I'm not sure how we could use the proc matching. Still investigating.

Mentioned in SAL (#wikimedia-cloud) [2018-11-26T13:25:37Z] <arturo> T210098 install systemd239 from stretch-backports and restart VM

Mentioned in SAL (#wikimedia-cloud) [2018-11-26T13:25:59Z] <arturo> T210098 VM=toolsbeta-sgebastion-03

Ok, I have a native systemd solution that may fulfill our use case (kind of) to set resource limits per user.

[Slice]
CPUAccounting=true
CPUQuota=10%
MemoryAccounting=true
MemoryLimit=600M
  • run systemd daemon-reload
  • check new limits:
sudo systemctl status user-18194.slice
● user-18194.slice - User Slice of UID 18194
   Loaded: loaded
  Drop-In: /lib/systemd/system/user-.slice.d
           └─10-defaults.conf
           /etc/systemd/system/user-.slice.d
           └─limits.conf
   Active: active since Mon 2018-11-26 13:32:36 UTC; 22s ago
    Tasks: 8 (limit: 10383)
   Memory: 20.3M (limit: 600.0M)
      CPU: 230ms
[...]

Now that we know a way for systemd to constraint resource for users, we may need to translate all the old policies we had in modules/toollabs/manifests/bastion.pp to this new scheme.
As you see, this solution implements per-user global resource control, as opposed to per-user per-process control (i.e, user X running PHP, user Y running Python).
I don't think losing a bit of granularity is a big deal in this case. What is a big deal is that this approach only works in Debian stretch running systemd from stretch-backports.

Change 476003 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: introduce bastion systemd-based resource control

https://gerrit.wikimedia.org/r/476003

Change 476003 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: introduce bastion systemd-based resource control

https://gerrit.wikimedia.org/r/476003

Change 476007 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: resource control: fix typo

https://gerrit.wikimedia.org/r/476007

Change 476007 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: resource control: fix typo

https://gerrit.wikimedia.org/r/476007

We may need an explicit override for user-0.slice (root) otherwise root get same limit applied, which makes very painful running puppet agent for example.

BTW the OOM killer due to user trying to break the memory limit really works :-)

[Tue Nov 27 12:57:12 2018] stress invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=0
[Tue Nov 27 12:57:12 2018] stress cpuset=/ mems_allowed=0
[Tue Nov 27 12:57:12 2018] CPU: 0 PID: 26425 Comm: stress Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u6
[Tue Nov 27 12:57:12 2018] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.9.3-20161116_142049-atsina 04/01/2014
[Tue Nov 27 12:57:12 2018]  0000000000000000 ffffffffb0331e54 ffffa3bec1c63dd8 ffff98a086611000
[Tue Nov 27 12:57:12 2018]  ffffffffb0205270 0000000000000000 0000000000000000 ffff98a0b68f6000
[Tue Nov 27 12:57:12 2018]  0000000000000000 ffffffffb0ef26b0 ffffffffb01fdadc 00000004ffffffff
[Tue Nov 27 12:57:12 2018] Call Trace:
[Tue Nov 27 12:57:12 2018]  [<ffffffffb0331e54>] ? dump_stack+0x5c/0x78
[Tue Nov 27 12:57:12 2018]  [<ffffffffb0205270>] ? dump_header+0x78/0x1fd
[Tue Nov 27 12:57:12 2018]  [<ffffffffb01fdadc>] ? mem_cgroup_scan_tasks+0xcc/0x100
[Tue Nov 27 12:57:12 2018]  [<ffffffffb0185d8a>] ? oom_kill_process+0x21a/0x3e0
[Tue Nov 27 12:57:12 2018]  [<ffffffffb0186221>] ? out_of_memory+0x111/0x470
[Tue Nov 27 12:57:12 2018]  [<ffffffffb01f8cd9>] ? mem_cgroup_out_of_memory+0x49/0x80
[Tue Nov 27 12:57:12 2018]  [<ffffffffb01fe495>] ? mem_cgroup_oom_synchronize+0x325/0x340
[Tue Nov 27 12:57:12 2018]  [<ffffffffb01f9860>] ? mem_cgroup_css_reset+0xd0/0xd0
[Tue Nov 27 12:57:12 2018]  [<ffffffffb01865af>] ? pagefault_out_of_memory+0x2f/0x80
[Tue Nov 27 12:57:12 2018]  [<ffffffffb00614ad>] ? __do_page_fault+0x4bd/0x4f0
[Tue Nov 27 12:57:12 2018]  [<ffffffffb06113d2>] ? schedule+0x32/0x80
[Tue Nov 27 12:57:12 2018]  [<ffffffffb0617048>] ? async_page_fault+0x28/0x30
[Tue Nov 27 12:57:12 2018] Task in /user.slice/user-18194.slice killed as a result of limit of /user.slice/user-18194.slice
[Tue Nov 27 12:57:12 2018] memory: usage 153600kB, limit 153600kB, failcnt 53755
[Tue Nov 27 12:57:12 2018] memory+swap: usage 675988kB, limit 9007199254740988kB, failcnt 0
[Tue Nov 27 12:57:12 2018] kmem: usage 3632kB, limit 9007199254740988kB, failcnt 0
[Tue Nov 27 12:57:12 2018] Memory cgroup stats for /user.slice/user-18194.slice: cache:0KB rss:149968KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:4KB swap:522388KB inactive_anon:74944KB active_anon:74916KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Nov 27 12:57:12 2018] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[Tue Nov 27 12:57:12 2018] [20461]     0 20461    27614     1769      58       3       21             0 sshd
[Tue Nov 27 12:57:12 2018] [20488] 18194 20488    27614     1089      54       3       50             0 sshd
[Tue Nov 27 12:57:12 2018] [20489] 18194 20489     8836      764      16       3      808             0 bash
[Tue Nov 27 12:57:12 2018] [23111] 18194 23111    17272     1590      37       4      266             0 systemd
[Tue Nov 27 12:57:12 2018] [23113] 18194 23113    63835      516      55       3       85             0 (sd-pam)
[Tue Nov 27 12:57:12 2018] [26423] 18194 26423     1819      200       9       3       22             0 stress
[Tue Nov 27 12:57:12 2018] [26424] 18194 26424     1819        4       8       3       20             0 stress
[Tue Nov 27 12:57:12 2018] [26425] 18194 26425   263964    19239     168       4    62207             0 stress
[Tue Nov 27 12:57:12 2018] [26426] 18194 26426     1819        4       8       3       20             0 stress
[Tue Nov 27 12:57:12 2018] [26427] 18194 26427   263964    18022     176       4    67384             0 stress
[Tue Nov 27 12:57:12 2018] Memory cgroup out of memory: Kill process 26427 (stress) score 505 or sacrifice child
[Tue Nov 27 12:57:12 2018] Killed process 26427 (stress) total-vm:1055856kB, anon-rss:71876kB, file-rss:212kB, shmem-rss:0kB
[Tue Nov 27 12:57:12 2018] oom_reaper: reaped process 26427 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Change 476012 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: split slice config for root/users

https://gerrit.wikimedia.org/r/476012

Change 476012 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: split slice config for root/users

https://gerrit.wikimedia.org/r/476012

aborrero claimed this task.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Ok, I think we are all set for now. The implementation looks elegant and robust, but we may need to fine tune limits as we go live with the new Toolforge.

@Bstorm feel free to reopen if you think we need further work on this.

Change 553122 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce systemd cgroup memory limits for stat1004

https://gerrit.wikimedia.org/r/553122