Page MenuHomePhabricator

cloudcephosd: the service unit user@0.service is in failed status
Closed, ResolvedPublic

Description

We just got alerts for all cloudcephosd1* hosts:

May 07 09:20:04 cloudcephosd1004 systemd[2410026]: user@0.service: Failed to attach to cgroup /user.slice/user-0.slice/user@0.service: Device or resource busy
May 07 09:20:04 cloudcephosd1004 systemd[2410026]: user@0.service: Failed at step CGROUP spawning /lib/systemd/systemd: Device or resource busy

Restarting the service cleans the error, but we don't know what that means, or why it happened on all the servers at the same time.

Event Timeline

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

aborrero renamed this task from cloudcephosd: user@0.service is in failed status to cloudcephosd: the service unit user@0.service is in failed status.Tue, May 7, 9:46 AM
aborrero edited projects, added Cloud-Services; removed Cloud-VPS.

The sessions seems to be a root login from cumin2002 (for all nodes):

May 07 09:20:04 cloudcephosd1010 systemd-logind[2594]: New session 96009 of user root.
May 07 09:20:04 cloudcephosd1010 systemd[1]: user@0.service: Succeeded.
May 07 09:20:04 cloudcephosd1010 systemd[1]: Starting User Manager for UID 0...
May 07 09:20:04 cloudcephosd1010 systemd[1081045]: user@0.service: Failed to attach to cgroup /user.slice/user-0.slice/user@0.service: Device or resource busy
May 07 09:20:04 cloudcephosd1010 systemd[1081045]: user@0.service: Failed at step CGROUP spawning /lib/systemd/systemd: Device or resource busy
May 07 09:20:04 cloudcephosd1010 systemd[1]: user@0.service: Main process exited, code=exited, status=219/CGROUP
May 07 09:20:04 cloudcephosd1010 systemd[1]: user@0.service: Failed with result 'exit-code'.
May 07 09:20:04 cloudcephosd1010 systemd[1]: Failed to start User Manager for UID 0.
May 07 09:20:04 cloudcephosd1010 systemd[1]: Started Session 96009 of user root.
May 07 09:20:04 cloudcephosd1010 systemd[1]: confd_prometheus_metrics.service: Succeeded.
May 07 09:20:04 cloudcephosd1010 systemd[1]: Finished Export confd Prometheus metrics.
May 07 09:20:04 cloudcephosd1010 systemd[1]: prometheus-node-pinger.service: Succeeded.
May 07 09:20:04 cloudcephosd1010 systemd[1]: Finished Generate prometheus network latency metrics with pings.
May 07 09:20:04 cloudcephosd1010 sshd[1081005]: Starting session: command for root from 2620:0:860:103:10:192:32:49 port 46082 id 0
May 07 09:20:05 cloudcephosd1010 sshd[1081005]: Close session: user root from 2620:0:860:103:10:192:32:49 port 46082 id 0
May 07 09:20:05 cloudcephosd1010 sshd[1081005]: Received disconnect from 2620:0:860:103:10:192:32:49 port 46082:11: disconnected by user
May 07 09:20:05 cloudcephosd1010 sshd[1081005]: Disconnected from user root 2620:0:860:103:10:192:32:49 port 46082
May 07 09:20:05 cloudcephosd1010 sshd[1081005]: pam_unix(sshd:session): session closed for user root
May 07 09:20:05 cloudcephosd1010 systemd[1]: session-96009.scope: Succeeded.
May 07 09:20:05 cloudcephosd1010 systemd-logind[2594]: Session 96009 logged out. Waiting for processes to exit.
May 07 09:20:05 cloudcephosd1010 systemd[1]: Stopping User Runtime Directory /run/user/0...
May 07 09:20:05 cloudcephosd1010 systemd-logind[2594]: Removed session 96009.
root@cloudcephosd1010:~# dig -x 2620:0:860:103:10:192:32:49
...
;; ANSWER SECTION:
9.4.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. 916 IN PTR cumin2002.codfw.wmnet.
dcaro claimed this task.
dcaro added a subscriber: MoritzMuehlenhoff.

@MoritzMuehlenhoff mentioned on irc that this is probably related to the upgrade of glibc + systemd issues under load, should be very infrequent.

To fix it just ran:

root@cumin2002:/var/log# cumin cloudcephosd1* 'systemctl reset-failed'

to clear the failed units, if it happens again or more often, we can try using something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/475306