Page MenuHomePhabricator

dev.toolforge.org unreachable
Closed, DuplicatePublic

Description

$ ssh dev.toolforge.org
Connection closed by 185.15.56.57 port 22

Remark by Lucas: I can get into toolforge-dev as root – looks like SSSD is down

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
root@tools-bastion-12:~# systemctl --failed
  UNIT                 LOAD   ACTIVE SUB    DESCRIPTION
● sssd-pam.service     loaded failed failed SSSD PAM Service responder
● user@redacted.service    loaded failed failed User Manager for UID redacted
● sssd-pam-priv.socket loaded failed failed SSSD PAM Service responder private socket
● sssd-pam.socket      loaded failed failed SSSD PAM Service responder socket

Looks like it just restarted too often:

root@tools-bastion-12:~# systemctl status sssd-pam sssd-pam{,-priv}.socket
× sssd-pam.service - SSSD PAM Service responder
     Loaded: loaded (/lib/systemd/system/sssd-pam.service; indirect; preset: enabled)
     Active: failed (Result: exit-code) since Sat 2025-03-22 02:33:00 UTC; 9h ago
   Duration: 34ms
TriggeredBy: × sssd-pam-priv.socket
             × sssd-pam.socket
       Docs: man:sssd.conf(5)
    Process: 1784988 ExecStartPre=/bin/chown root:root /var/log/sssd/sssd_pam.log (code=exited, status=0/SUCCESS)
    Process: 1784989 ExecStart=/usr/libexec/sssd/sssd_pam ${DEBUG_LOGGER} --socket-activated (code=exited, status=3)
   Main PID: 1784989 (code=exited, status=3)
        CPU: 40ms

Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.service: Scheduled restart job, restart counter is at 5.
Mär 22 02:33:00 tools-bastion-12 systemd[1]: Stopped sssd-pam.service - SSSD PAM Service responder.
Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.service: Start request repeated too quickly.
Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.service: Failed with result 'exit-code'.
Mär 22 02:33:00 tools-bastion-12 systemd[1]: Failed to start sssd-pam.service - SSSD PAM Service responder.

× sssd-pam.socket - SSSD PAM Service responder socket
     Loaded: loaded (/lib/systemd/system/sssd-pam.socket; enabled; preset: enabled)
     Active: failed (Result: service-start-limit-hit) since Sat 2025-03-22 02:33:00 UTC; 9h ago
   Duration: 5d 20h 2min 14.255s
   Triggers: ● sssd-pam.service
       Docs: man:sssd.conf(5)
     Listen: /var/lib/sss/pipes/pam (Stream)
        CPU: 10ms

Mär 16 06:30:46 tools-bastion-12 systemd[1]: Starting sssd-pam.socket - SSSD PAM Service responder socket...
Mär 16 06:30:46 tools-bastion-12 systemd[1]: Listening on sssd-pam.socket - SSSD PAM Service responder socket.
Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam.socket: Failed with result 'service-start-limit-hit'.
Mär 22 02:33:00 tools-bastion-12 systemd[1]: Closed sssd-pam.socket - SSSD PAM Service responder socket.

× sssd-pam-priv.socket - SSSD PAM Service responder private socket
     Loaded: loaded (/lib/systemd/system/sssd-pam-priv.socket; enabled; preset: enabled)
     Active: failed (Result: service-start-limit-hit) since Sat 2025-03-22 02:33:00 UTC; 9h ago
   Duration: 5d 20h 2min 14.257s
   Triggers: ● sssd-pam.service
       Docs: man:sssd.conf(5)
     Listen: /var/lib/sss/pipes/private/pam (Stream)
        CPU: 16ms

Mär 16 06:30:46 tools-bastion-12 systemd[1]: Starting sssd-pam-priv.socket - SSSD PAM Service responder private socket...
Mär 16 06:30:46 tools-bastion-12 systemd[1]: Listening on sssd-pam-priv.socket - SSSD PAM Service responder private socket.
Mär 22 02:33:00 tools-bastion-12 systemd[1]: sssd-pam-priv.socket: Failed with result 'service-start-limit-hit'.

Maybe it’s safe to just restart it again, but I don’t want to risk messing things up and I don’t think toolforge-dev being down is critical (as long as the regular bastion is up), so I’ll leave this to others who know more about the system :)

Seems like it 🤷 journal indicates SSSD was started again 13:11:45 UTC for no reason that I can make out.

As far as I can see, there's been some flakiness with LDAP (this are the sssd_wikimedia.org.log* files):

root@tools-bastion-12:~# grep --no-filename "Can't contact LDAP server" /var/log/sssd/* | sort
...
(2025-03-16  0:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-16  1:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-16 15:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17  5:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17  6:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17  7:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17  8:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17  9:50:10): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 10:50:11): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 11:50:05): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 12:50:05): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 13:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 14:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 15:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 16:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-17 17:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18  1:46:20): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18  2:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18  3:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18  4:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18 15:45:56): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18 16:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18 17:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-18 18:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-19  4:15:47): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-19  4:25:09): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-19  4:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-19  5:15:35): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-19  5:50:09): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-19 18:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-20  8:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-20 21:50:13): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-21  9:50:03): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-22 10:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-22 11:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-22 11:50:04): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-22 12:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-23  0:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-23  1:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-23  2:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-23  2:50:05): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-23 15:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-24  3:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-24  4:15:39): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-24  4:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]
(2025-03-24  5:50:01): [be[wikimedia.org]] [sdap_process_result] (0x0040): ldap_result error: [Can't contact LDAP server]

the ssd process is supposed to be able to recover relatively quickly from it, it just retries, the sssd process spaws the sssd_be one and monitors it outside systemd, that's the one that's crashing when LDAP fails to connect. The logs:

root@tools-bastion-12:~# systemctl status sssd
● sssd.service - System Security Services Daemon
     Loaded: loaded (/lib/systemd/system/sssd.service; enabled; preset: enabled)
     Active: active (running) since Sun 2025-03-16 06:30:46 UTC; 1 week 1 day ago
   Main PID: 1288209 (sssd)
      Tasks: 2 (limit: 19177)
     Memory: 85.5M
        CPU: 3h 38min 39.632s
     CGroup: /system.slice/sssd.service
             ├─1288209 /usr/sbin/sssd -i --logger=files
             └─1961880 /usr/libexec/sssd/sssd_be --domain wikimedia.org --uid 0 --gid 0 --logger=files

Mar 24 04:59:54 tools-bastion-12 sssd[1288209]: Child [1949085] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 04:59:54 tools-bastion-12 sssd_be[1949425]: Starting up
Mar 24 05:20:35 tools-bastion-12 sssd[1288209]: Child [1949425] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 05:20:35 tools-bastion-12 sssd_be[1950805]: Starting up
Mar 24 05:41:17 tools-bastion-12 sssd[1288209]: Child [1950805] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 05:41:17 tools-bastion-12 sssd_be[1951282]: Starting up
Mar 24 05:56:58 tools-bastion-12 sssd[1288209]: Child [1951282] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 05:56:58 tools-bastion-12 sssd_be[1952590]: Starting up
Mar 24 09:07:39 tools-bastion-12 sssd[1288209]: Child [1952590] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Mar 24 09:07:39 tools-bastion-12 sssd_be[1961880]: Starting up

There's two weird things there though, one is that on the sssd_wikimedia.org.log files (for the sssd_be subprocess) the connection failed errors are very often at *:50:*, and those don't match the parent's logs terminating the process.
Not sure if this is new though, but definitely there's some interesting patterns.