$ ssh dev.toolforge.org
Connection closed by 185.15.56.57 port 22
Remark by Lucas: I can get into toolforge-dev as root – looks like SSSD is down
$ ssh dev.toolforge.org
Connection closed by 185.15.56.57 port 22
Remark by Lucas: I can get into toolforge-dev as root – looks like SSSD is down
root@tools-bastion-12:~# systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● sssd-pam.service loaded failed failed SSSD PAM Service responder ● user@redacted.service loaded failed failed User Manager for UID redacted ● sssd-pam-priv.socket loaded failed failed SSSD PAM Service responder private socket ● sssd-pam.socket loaded failed failed SSSD PAM Service responder socket
Looks like it just restarted too often:
root@tools-bastion-12:~# systemctl status sssd-pam sssd-pam{,-priv}.socket × sssd-pam.service - SSSD PAM Service responder Loaded: loaded (/lib/systemd/system/sssd-pam.service; indirect; preset: enabled) Active: failed (Result: exit-code) since Sat 2025-03-22 02:33:00 UTC; 9h ago Duration: 34ms TriggeredBy: × sssd-pam-priv.socket × sssd-pam.socket Docs: man:sssd.conf(5) Process: 1784988 ExecStartPre=/bin/chown root:root /var/log/sssd/sssd_pam.log (code=exited, status=0/SUCCESS) Process: 1784989 ExecStart=/usr/libexec/sssd/sssd_pam ${DEBUG_LOGGER} --socket-activated (code=exited, status=3) Main PID: 1784989 (code=exited, status=3) CPU: 40ms Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.service: Scheduled restart job, restart counter is at 5. Mär 22 02:33:00 tools-bastion-12 systemd[1]: Stopped sssd-pam.service - SSSD PAM Service responder. Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.service: Start request repeated too quickly. Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.service: Failed with result 'exit-code'. Mär 22 02:33:00 tools-bastion-12 systemd[1]: Failed to start sssd-pam.service - SSSD PAM Service responder. × sssd-pam.socket - SSSD PAM Service responder socket Loaded: loaded (/lib/systemd/system/sssd-pam.socket; enabled; preset: enabled) Active: failed (Result: service-start-limit-hit) since Sat 2025-03-22 02:33:00 UTC; 9h ago Duration: 5d 20h 2min 14.255s Triggers: ● sssd-pam.service Docs: man:sssd.conf(5) Listen: /var/lib/sss/pipes/pam (Stream) CPU: 10ms Mär 16 06:30:46 tools-bastion-12 systemd[1]: Starting sssd-pam.socket - SSSD PAM Service responder socket... Mär 16 06:30:46 tools-bastion-12 systemd[1]: Listening on sssd-pam.socket - SSSD PAM Service responder socket. Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.socket: Failed with result 'service-start-limit-hit'. Mär 22 02:33:00 tools-bastion-12 systemd[1]: Closed sssd-pam.socket - SSSD PAM Service responder socket. × sssd-pam-priv.socket - SSSD PAM Service responder private socket Loaded: loaded (/lib/systemd/system/sssd-pam-priv.socket; enabled; preset: enabled) Active: failed (Result: service-start-limit-hit) since Sat 2025-03-22 02:33:00 UTC; 9h ago Duration: 5d 20h 2min 14.257s Triggers: ● sssd-pam.service Docs: man:sssd.conf(5) Listen: /var/lib/sss/pipes/private/pam (Stream) CPU: 16ms Mär 16 06:30:46 tools-bastion-12 systemd[1]: Starting sssd-pam-priv.socket - SSSD PAM Service responder private socket... Mär 16 06:30:46 tools-bastion-12 systemd[1]: Listening on sssd-pam-priv.socket - SSSD PAM Service responder private socket. Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam-priv.socket: Failed with result 'service-start-limit-hit'.
Maybe it’s safe to just restart it again, but I don’t want to risk messing things up and I don’t think toolforge-dev being down is critical (as long as the regular bastion is up), so I’ll leave this to others who know more about the system :)
Seems like it 🤷 journal indicates SSSD was started again 13:11:45 UTC for no reason that I can make out.
As far as I can see, there's been some flakiness with LDAP (this are the sssd_wikimedia.org.log* files):
root@tools-bastion-12:~# grep --no-filename "Can't contact LDAP server" /var/log/sssd/* | sort ... (2025-03-16 0:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-16 1:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-16 15:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 5:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 6:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 7:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 8:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 9:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 10:50:11): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 11:50:05): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 12:50:05): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 13:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 14:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 15:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 16:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-17 17:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 1:46:20): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 2:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 3:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 4:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 15:45:56): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 16:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 17:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-18 18:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-19 4:15:47): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-19 4:25:09): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-19 4:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-19 5:15:35): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-19 5:50:09): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-19 18:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-20 8:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-20 21:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-21 9:50:03): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-22 10:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-22 11:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-22 11:50:04): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-22 12:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-23 0:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-23 1:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-23 2:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-23 2:50:05): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-23 15:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-24 3:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-24 4:15:39): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-24 4:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server] (2025-03-24 5:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
the ssd process is supposed to be able to recover relatively quickly from it, it just retries, the sssd process spaws the sssd_be one and monitors it outside systemd, that's the one that's crashing when LDAP fails to connect. The logs:
root@tools-bastion-12:~# systemctl status sssd
● sssd.service - System Security Services Daemon
Loaded: loaded (/lib/systemd/system/sssd.service; enabled; preset: enabled)
Active: active (running) since Sun 2025-03-16 06:30:46 UTC; 1 week 1 day ago
Main PID: 1288209 (sssd)
Tasks: 2 (limit: 19177)
Memory: 85.5M
CPU: 3h 38min 39.632s
CGroup: /system.slice/sssd.service
├─1288209 /usr/sbin/sssd -i --logger=files
└─1961880 /usr/libexec/sssd/sssd_be --domain wikimedia.org --uid 0 --gid 0 --logger=files
Mar 24 04:59:54 tools-bastion-12 sssd[1288209]: Child [1949085] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 04:59:54 tools-bastion-12 sssd_be[1949425]: Starting up
Mar 24 05:20:35 tools-bastion-12 sssd[1288209]: Child [1949425] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 05:20:35 tools-bastion-12 sssd_be[1950805]: Starting up
Mar 24 05:41:17 tools-bastion-12 sssd[1288209]: Child [1950805] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 05:41:17 tools-bastion-12 sssd_be[1951282]: Starting up
Mar 24 05:56:58 tools-bastion-12 sssd[1288209]: Child [1951282] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 05:56:58 tools-bastion-12 sssd_be[1952590]: Starting up
Mar 24 09:07:39 tools-bastion-12 sssd[1288209]: Child [1952590] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 09:07:39 tools-bastion-12 sssd_be[1961880]: Starting upThere's two weird things there though, one is that on the sssd_wikimedia.org.log files (for the sssd_be subprocess) the connection failed errors are very often at *:50:*, and those don't match the parent's logs terminating the process.
Not sure if this is new though, but definitely there's some interesting patterns.