Page MenuHomePhabricator

sssd permanent failure on integration-agent-docker-1029
Open, Needs TriagePublic

Description

I tried to SSH to integration-agent-docker-1029 but it didn't work. So I added myself to the root authorized_keys and investigated.

getent passwd tstarling is empty.

2022-12-04 16:26:46: start time of sssd_be

/var/log/sssd/sssd_nss.log.1:

(2022-12-04 16:26:47): [nss] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_wikimedia.org [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_wikimedia.org: Connection refused
(2022-12-04 16:26:48): [nss] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus
(2022-12-04 16:26:51): [nss] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_wikimedia.org [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_wikimedia.org: Connection refused
(2022-12-04 16:26:51): [nss] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus
(2022-12-04 16:27:01): [nss] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_wikimedia.org [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_wikimedia.org: Connection refused
(2022-12-04 16:27:02): [nss] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus
(2022-12-04 16:27:02): [nss] [sbus_reconnect] (0x0020): Unable to reconnect: maximum retries exceeded.
(2022-12-04 16:27:02): [nss] [sss_dp_on_reconnect] (0x0010): Could not reconnect to wikimedia.org provider.
(2022-12-04 16:30:53): [nss] [cache_req_common_process_dp_reply] (0x0040): CR #205723: Could not get account info [1432158212]: SSSD is offline

The last message, "SSSD is offline", is repeated up to the present (2022-12-12 6:06:00).

The backend log, sssd_wikimedia.org.log.1, is quiet at the time of the initial failure, until:

(2022-12-04 16:30:51): [be[wikimedia.org]] [server_setup] (0x0040): Starting with debug level = 0x0070

syslog:

Dec  4 16:30:51 integration-agent-docker-1029 kernel: [3544302.900910] docker invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
...
Dec  4 16:30:51 integration-agent-docker-1029 kernel: [3544302.901530] Out of memory: Killed process 3715275 (php) total-vm:23878800kB, anon-rss:22873048kB, file-rss:0kB, shmem-rss:0kB, UID:65534 pgtables:46708kB oom_score_adj:0
...
Dec  4 16:30:51 integration-agent-docker-1029 sssd[613]: Child [664] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.

I restarted sssd, which fixed the problem.

I suggest monitoring for this kind of sssd failure and/or submitting a patch upstream.

Event Timeline

Someone who is in the cloudinfra project should run something like

sudo cumin -b10 '*' 'getent passwd tstarling || echo help'

I did it in deployment-prep and found another broken host with SSSD in offline mode: deployment-docker-wikifunctions01. I fixed it.

For sssd I have found an other issue via T349681#9278821 which is that the service unit has a restart counter limit and once reached it is never restarted again. Worth the sssd-nss.socket is itself shut / marked as failed, so the service is never coming back. The fix I had was to use cumin to systemctl start sssd-nss.socket. I haven't bother going the extra mile of fixing the systemd service/socket definitions though :-\