Page MenuHomePhabricator

Toolforge: sgebastion: systemd resource control not working
Open, HighPublic

Description

I just found this:

aborrero@tools-sgebastion-06:~$ sudo systemctl status user-.slice
● user-.slice
   Loaded: error (Reason: Invalid argument)
  Drop-In: /etc/systemd/system/user-.slice.d
           └─puppet-override.conf
   Active: inactive (dead)

Feb 04 11:14:26 tools-sgebastion-06 systemd[1]: [/etc/systemd/system/user-.slice.d/puppet-override.conf:11] Memory limit '0' out of range. Ignoring.
Feb 04 11:14:26 tools-sgebastion-06 systemd[1]: [/etc/systemd/system/user-.slice.d/puppet-override.conf:14] Unknown lvalue 'IPAccounting' in section 'Slice'
Feb 04 11:14:26 tools-sgebastion-06 systemd[1]: user-.slice: Slice name user-.slice is not valid. Refusing.

aborrero@tools-sgebastion-06:~$ apt-cache policy systemd
systemd:
  Installed: 232-25+deb9u8
  Candidate: 232-25+deb9u8
  Version table:
     239-12~bpo9+1 100
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
 *** 232-25+deb9u8 500
        500 http://security.debian.org stretch/updates/main amd64 Packages
        100 /var/lib/dpkg/status
     232-25+deb9u6 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages
  • We have the wrong systemd version installed
  • We have some typo in the config

Event Timeline

aborrero created this task.Mon, Feb 4, 11:26 AM
aborrero triaged this task as High priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Feb 4, 11:26 AM

Change 487823 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: introduce apt pinning for systemd

https://gerrit.wikimedia.org/r/487823

Change 487823 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: introduce apt pinning for systemd

https://gerrit.wikimedia.org/r/487823

Mentioned in SAL (#wikimedia-cloud) [2019-02-04T11:36:32Z] <arturo> T215154 manually install systemd 239 in tools-sgebastion-06

Mentioned in SAL (#wikimedia-cloud) [2019-02-04T11:38:49Z] <arturo> T215154 reboot tools-sgebastion-06 to totally refresh systemd status

Mentioned in SAL (#wikimedia-cloud) [2019-02-04T12:26:18Z] <arturo> T215154 another reboot for tools-sgebastion-06. Puppet is disabled

Something is not making any sense. I can go over the memory limit and systemd does nothing despite limits being set:

root@tools-sgebastion-06:~# systemctl status user-18194.slice
● user-18194.slice
   Loaded: loaded
  Drop-In: /etc/systemd/system/user-.slice.d
           └─puppet-override.conf
   Active: active since Mon 2019-02-04 12:43:51 UTC; 1min 6s ago
    Tasks: 9 (limit: 100)
   Memory: 760.5M (high: 100.0M max: 150.0M swap max: 0B limit: 150.0M)
      CPU: 4.993s
   CGroup: /user.slice/user-18194.slice
           ├─session-9.scope
           │ ├─1307 sshd: aborrero [priv]
           │ ├─1333 sshd: aborrero@pts/1
           │ ├─1334 -bash
           │ ├─2378 stress -c 2 --vm 1 --vm-bytes 1G
           │ ├─2379 stress -c 2 --vm 1 --vm-bytes 1G
           │ ├─2380 stress -c 2 --vm 1 --vm-bytes 1G
           │ └─2381 stress -c 2 --vm 1 --vm-bytes 1G

I'm now running the correct systemd version:

root@tools-sgebastion-06:~# dpkg -s systemd | grep Version
Version: 239-12~bpo9+1
root@tools-sgebastion-06:~# systemd --version
systemd 239
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid

Also, other weird thing is this:

root@tools-sgebastion-06:~# systemctl status user-.slice
Warning: The unit file, source configuration file or drop-ins of user-.slice changed on disk. Run 'systemctl daemon-reload' to reload units.
● user-.slice
   Loaded: error (Reason: Unit user-.slice failed to loaded properly: Invalid argument.)
  Drop-In: /etc/systemd/system/user-.slice.d
           └─puppet-override.conf
   Active: inactive (dead)

The systemctl report for that slice is 'invalid' but there is no further information on what is failing, Apparently daemon-reload does nothing.

Is this one of those template units that need to be enabled for each user separately? I see there's a "user-0.slice" for root.

Mentioned in SAL (#wikimedia-cloud) [2019-02-04T13:19:58Z] <arturo> T215154 another reboot for tools-sgebastion-06

Is this one of those template units that need to be enabled for each user separately? I see there's a "user-0.slice" for root.

Yes, user-0.slice is for root. We need an explicit config for root because otherwise it will get applied the same limits that the rest of the users (i.e user-.slice).

I just wrote this https://wikitech.wikimedia.org/wiki/Systemd_resource_control

The problem has been mostly solved now:

root@tools-sgebastion-06:~# systemctl status user-.slice
Warning: The unit file, source configuration file or drop-ins of user-.slice changed on disk. Run 'systemctl daemon-reload' to reload units.
● user-.slice
   Loaded: error (Reason: Unit user-.slice failed to loaded properly: Invalid argument.)
  Drop-In: /etc/systemd/system/user-.slice.d
           └─puppet-override.conf
   Active: inactive (dead)

^^^ this happen because a feature/bug in systemctl status which doesn't like the 'fake' unit name ending in a dash

root@tools-sgebastion-06:~# systemctl status user-18194.slice
● user-18194.slice
   Loaded: loaded
  Drop-In: /etc/systemd/system/user-.slice.d
           └─puppet-override.conf
   Active: active since Mon 2019-02-04 12:43:51 UTC; 1min 6s ago
    Tasks: 9 (limit: 100)
   Memory: 760.5M (high: 100.0M max: 150.0M swap max: 0B limit: 150.0M)

^^^ this happened because I already had a lot of memory allocated and was trying to set a limit above the current allocation. In the logs I saw:
user-18194.slice: Failed to set memory.limit_in_bytes: Device or resource busy

aborrero@tools-sgebastion-06:~$ sudo systemctl status user-.slice
● user-.slice
   Loaded: error (Reason: Invalid argument)
  Drop-In: /etc/systemd/system/user-.slice.d
           └─puppet-override.conf
   Active: inactive (dead)

Feb 04 11:14:26 tools-sgebastion-06 systemd[1]: [/etc/systemd/system/user-.slice.d/puppet-override.conf:11] Memory limit '0' out of range. Ignoring.
Feb 04 11:14:26 tools-sgebastion-06 systemd[1]: [/etc/systemd/system/user-.slice.d/puppet-override.conf:14] Unknown lvalue 'IPAccounting' in section 'Slice'
Feb 04 11:14:26 tools-sgebastion-06 systemd[1]: user-.slice: Slice name user-.slice is not valid. Refusing.

^^^ all these are gone after upgrading to systemd 239.

Change 487847 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: also apt pin udev

https://gerrit.wikimedia.org/r/487847

Change 487886 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: split resource control puppet code

https://gerrit.wikimedia.org/r/487886

Change 487886 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: split resource control puppet code

https://gerrit.wikimedia.org/r/487886

Change 487847 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: also apt pin udev

https://gerrit.wikimedia.org/r/487847

aborrero closed this task as Resolved.Mon, Feb 4, 5:37 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

I think we are all set.

aborrero reopened this task as Open.Thu, Feb 14, 2:21 PM

Reopening because we moved from tools-sgebastion06 to tools-sgebastion07 and the systemd version isn't right, so we don't have resource control:

aborrero@tools-sgebastion-07:~$ sudo apt-get install systemd
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 systemd : Depends: libsystemd0 (= 239-12~bpo9+1) but 232-25+deb9u8 is to be installed
E: Unable to correct problems, you have held broken packages.
aborrero@tools-sgebastion-07:~$ apt-cache policy udev systemd libsystemd0
udev:
  Installed: 232-25+deb9u8
  Candidate: 239-12~bpo9+1
  Version table:
     239-12~bpo9+1 1001
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
 *** 232-25+deb9u8 500
        500 http://security.debian.org stretch/updates/main amd64 Packages
        100 /var/lib/dpkg/status
     232-25+deb9u6 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages
systemd:
  Installed: 232-25+deb9u8
  Candidate: 239-12~bpo9+1
  Version table:
     239-12~bpo9+1 1001
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
 *** 232-25+deb9u8 500
        500 http://security.debian.org stretch/updates/main amd64 Packages
        100 /var/lib/dpkg/status
     232-25+deb9u6 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages
libsystemd0:
  Installed: 232-25+deb9u8
  Candidate: 232-25+deb9u8
  Version table:
     239-12~bpo9+1 100
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
 *** 232-25+deb9u8 500
        500 http://security.debian.org stretch/updates/main amd64 Packages
        100 /var/lib/dpkg/status
     232-25+deb9u6 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages

This indicates we should extend the apt pinning to libsystemd0 as well.

Change 490605 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] systemd: user slice: pinning for libsystemd0 as well

https://gerrit.wikimedia.org/r/490605

Change 490605 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] systemd: user slice: pinning for libsystemd0 as well

https://gerrit.wikimedia.org/r/490605

Mentioned in SAL (#wikimedia-cloud) [2019-02-14T17:35:24Z] <arturo> T215154 tools-sgebastion-07 now running systemd 239 and starts enforcing user limits

aborrero closed this task as Resolved.Sat, Feb 16, 3:12 PM
aborrero reopened this task as Open.Tue, Feb 19, 1:29 PM

Reopening because I saw something weird in the server regarding this and I don't think is working as expected. I don't have yet any clues of what's wrong though.

bd808 added a subscriber: bd808.Wed, Feb 20, 1:48 AM

On the new tools-sgebastion-08.tools.eqiad.wmflabs:

$ apt-cache policy udev systemd libsystemd0
udev:
  Installed: 239-12~bpo9+1
  Candidate: 239-12~bpo9+1
  Version table:
 *** 239-12~bpo9+1 100
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
        100 /var/lib/dpkg/status
     232-25+deb9u9 500
        500 http://security.debian.org stretch/updates/main amd64 Packages
     232-25+deb9u8 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages
systemd:
  Installed: 239-12~bpo9+1
  Candidate: 239-12~bpo9+1
  Version table:
 *** 239-12~bpo9+1 100
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
        100 /var/lib/dpkg/status
     232-25+deb9u9 500
        500 http://security.debian.org stretch/updates/main amd64 Packages
     232-25+deb9u8 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages
libsystemd0:
  Installed: 239-12~bpo9+1
  Candidate: 239-12~bpo9+1
  Version table:
 *** 239-12~bpo9+1 1001
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
        100 /var/lib/dpkg/status
     232-25+deb9u9 500
        500 http://security.debian.org stretch/updates/main amd64 Packages
     232-25+deb9u8 500
        500 http://deb.debian.org/debian stretch/main amd64 Packages
$ sudo systemctl --no-pager status user-.slice
● user-.slice - User Slice of UID
   Loaded: error (Reason: Unit user-.slice failed to loaded properly: Invalid argument.)
  Drop-In: /lib/systemd/system/user-.slice.d
           └─10-defaults.conf
           /etc/systemd/system/user-.slice.d
           └─puppet-override.conf
   Active: inactive (dead)
Warning: The unit file, source configuration file or drop-ins of user-.slice changed on disk. Run 'systemctl daemon-reload' to reload units.