Page MenuHomePhabricator

Sudo sss directives failing intermittently
Open, Needs TriagePublicBUG REPORT

Description

We get almost daily emails with the message:

Subject: *** SECURITY information for tools-sgeexec-0939.tools.eqiad.wmflabs ***

tools-sgeexec-0939.tools.eqiad.wmflabs : Jun  2 01:52:08 : prometheus : problem with defaults entries ; TTY=unknown ; PWD=/var/lib/prometheus ; USER=root ;

That is caused by failing to retrieve the sudo directives from the sss module:

root@tools-sgeexec-0939:~# grep sudoers /etc/nsswitch.conf
sudoers:        files sss

We should investigate and/or keep track of these errors (might be periodical LDAP issues)

Event Timeline

dcaro created this task.

Same as the parent task, these messages followed a segmentation fault one:

Subject: Cron <prometheus@tools-sgeexec-0939> /usr/local/bin/prometheus-local-crontabs

/usr/local/bin/prometheus-local-crontabs: line 27:  5429 Segmentation fault      /usr/bin/sudo -u root /bin/ls -1 /var/spool/cron/crontabs/

So that one might be the trigger for the others (or all of them just caused by the LDAP hiccup).

Change 699216 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] tools: try to alleviate sudo crashing when triggering oom

https://gerrit.wikimedia.org/r/699216

Change 699216 merged by David Caro:

[operations/puppet@production] tools: try to alleviate sudo crashing when triggering oom

https://gerrit.wikimedia.org/r/699216

The patch did avoid having an oom right at the time that sudo ran, but the segmentation fault keeps happening.
This is the journal logs for the last errors:

Jun 14 06:30:01 tools-sgeexec-0917 CRON[27032]: (prometheus) CMD (/usr/local/bin/prometheus-local-crontabs)
Jun 14 06:30:01 tools-sgeexec-0917 systemd[1]: prometheus_ssh_open_sessions.service: Failed with result 'exit-code'.
Jun 14 06:30:01 tools-sgeexec-0917 systemd[1]: prometheus_ssh_open_sessions.service: Main process exited, code=exited, status=1/FAILURE
Jun 14 06:30:01 tools-sgeexec-0917 systemd[1]: prometheus_ssh_open_sessions.service: Unit entered failed state.
Jun 14 06:30:01 tools-sgeexec-0917 systemd[1]: Started Regular job to collect active shell session information.
Jun 14 06:30:01 tools-sgeexec-0917 systemd[1]: Started Regular job to collect puppet agent stats.
Jun 14 06:30:02 tools-sgeexec-0917 CRON[27031]: pam_unix(cron:session): session closed for user prometheus
Jun 14 06:30:02 tools-sgeexec-0917 sudo[27050]: pam_unix(sudo:session): session closed for user root
Jun 14 06:30:02 tools-sgeexec-0917 sudo[27050]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jun 14 06:30:02 tools-sgeexec-0917 sudo[27050]: prometheus : TTY=unknown ; PWD=/var/lib/prometheus ; USER=root ; COMMAND=/bin/ls -1 /var/spool/cron/crontabs/
Jun 14 06:30:12 tools-sgeexec-0917 sshd[27352]: Connection from 172.16.0.103 port 45816 on 172.16.1.229 port 22
Jun 14 06:30:12 tools-sgeexec-0917 sshd[27352]: Did not receive identification string from 172.16.0.103 port 45816
Jun 14 06:30:12 tools-sgeexec-0917 sshd[27353]: Connection from 172.16.1.8 port 40576 on 172.16.1.229 port 22
Jun 14 06:30:12 tools-sgeexec-0917 sshd[27353]: Did not receive identification string from 172.16.1.8 port 40576
Jun 14 06:31:09 tools-sgeexec-0917 systemd[1]: Started Regular job to collect puppet agent stats.
Jun 14 06:31:12 tools-sgeexec-0917 sshd[28927]: Connection from 172.16.0.103 port 46126 on 172.16.1.229 port 22
Jun 14 06:31:12 tools-sgeexec-0917 sshd[28927]: Did not receive identification string from 172.16.0.103 port 46126
Jun 14 06:31:12 tools-sgeexec-0917 sshd[28928]: Connection from 172.16.1.8 port 40870 on 172.16.1.229 port 22
Jun 14 06:31:12 tools-sgeexec-0917 sshd[28928]: Did not receive identification string from 172.16.1.8 port 40870

And on sssd logs there's nothing relevant at that time:

root@tools-sgeexec-0917:~# for i in /var/log/sssd/sssd_*log; do echo "####### $i"; grep 'Jun 14' $i; echo; done
####### /var/log/sssd/sssd_nss.log
(Mon Jun 14 08:11:34 2021) [sssd[nss]] [nss_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:12 2021) [sssd[nss]] [nss_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:43 2021) [sssd[nss]] [nss_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:15 2021) [sssd[nss]] [nss_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:47 2021) [sssd[nss]] [nss_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 09:11:08 2021) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children

####### /var/log/sssd/sssd_pam.log
(Mon Jun 14 08:11:32 2021) [sssd[pam]] [pam_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:03 2021) [sssd[pam]] [pam_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:30 2021) [sssd[pam]] [pam_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:01 2021) [sssd[pam]] [pam_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:31 2021) [sssd[pam]] [pam_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:14:00 2021) [sssd[pam]] [pam_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.

####### /var/log/sssd/sssd_ssh.log
(Mon Jun 14 08:11:32 2021) [sssd[ssh]] [ssh_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:03 2021) [sssd[ssh]] [ssh_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:30 2021) [sssd[ssh]] [ssh_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:01 2021) [sssd[ssh]] [ssh_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:31 2021) [sssd[ssh]] [ssh_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:14:00 2021) [sssd[ssh]] [ssh_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.

####### /var/log/sssd/sssd_sudo.log
(Mon Jun 14 08:11:34 2021) [sssd[sudo]] [sudo_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:03 2021) [sssd[sudo]] [sudo_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:12:30 2021) [sssd[sudo]] [sudo_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:01 2021) [sssd[sudo]] [sudo_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:13:31 2021) [sssd[sudo]] [sudo_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.
(Mon Jun 14 08:14:00 2021) [sssd[sudo]] [sudo_dp_reconnect_init] (0x0010): Could not reconnect to wikimedia.org provider.

####### /var/log/sssd/sssd_wikimedia.org.log
(Mon Jun 14 08:10:47 2021) [sssd[be[wikimedia.org]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Mon Jun 14 09:19:42 2021) [sssd[be[wikimedia.org]]] [orderly_shutdown] (0x0010): SIGTERM: killing children

There's a bunch of connection issues, that might be relevant, but nothing that correlates with the event.
It seems though that logins were failing too at that time by the look of the journal.

The service prometheus_ssh_open_sessions.service seems to fail quite often too, it had been failing all the time for 2
days until this morning.

Adding some debug options to get more info.

Mentioned in SAL (#wikimedia-cloud) [2021-06-14T10:13:05Z] <dcaro> setting ssd to debug mode on tools-sgeexec-0917 (T284130)