Page MenuHomePhabricator

Ssh / user issue with integration-agent-docker-1057.integration.eqiad1.wikimedia.cloud
Closed, ResolvedPublic

Description

I can not ssh anymore to integration-agent-docker-1057.integration.eqiad1.wikimedia.cloud and get a permission denied.

I can still connect to it via Cumin from integration-cumin.integration.eqiad.wmflabs since it its key is added un /etc/ssh/userkeys/root.d which makes me suspect an issue with LDAP.

When using Cumin to trigger a puppet run with sudo cumin --force 'name:agent-docker-1057' 'puppet agent -tv':

Notice: /Stage[main]/Profile::Ci::Docker/Exec[jenkins user docker membership]/returns: usermod: user 'jenkins-deploy' does not exist                                                                                                        
Error: '/usr/sbin/usermod -aG docker 'jenkins-deploy'' returned 6 instead of one of [0]                               
Error: /Stage[main]/Profile::Ci::Docker/Exec[jenkins user docker membership]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/usermod -aG docker 'jenkins-deploy'' returned 6 instead of one of [0] (corrective)                   
Error: Could not find user jenkins-deploy                                                                             
Error: /Stage[main]/Profile::Ci::Slave::Labs::Common/File[/srv/jenkins]/owner: change from 2947 to 'jenkins-deploy' failed: Could not find user jenkins-deploy                                                                              
Error: Could not find group wikidev                                                                                   
Error: /Stage[main]/Profile::Ci::Slave::Labs::Common/File[/srv/jenkins]/group: change from 500 to 'wikidev' failed: Could not find group wikidev                                                                                            
Notice: /Stage[main]/Profile::Ci::Slave::Labs::Common/File[/srv/jenkins/cache]: Dependency File[/srv/jenkins] has failures: true                                 
...
Error: Could not find user jenkins-deploy                                                                             
Error: /Stage[main]/Profile::Ci::Slave::Labs::Common/File[/srv/home/jenkins-deploy]/owner: change from 2947 to 'jenkins-deploy' failed: Could not find user jenkins-deploy                                                                  
Error: Could not find group wikidev                                                                                   
Error: /Stage[main]/Profile::Ci::Slave::Labs::Common/File[/srv/home/jenkins-deploy]/group: change from 500 to 'wikidev' failed: Could not find group wikidev

The jenkins-deploy user is defined in LDAP https://ldap.toolforge.org/user/jenkins-deploy with uid 2947.

So I guess the LDAP configuration is broken on that instance somehow or it can't reach LDAP?

Event Timeline

From the Puppet log file:

Oct 24 12:41:01 integration-agent-docker-1057 puppet-agent[1150814]: Applying configuration version '(ac14b2362a) Majavah - P:wmcs::metricsinfra: update karma config to match alerts.wm.o'                                                   
Oct 24 12:41:09 integration-agent-docker-1057 puppet-agent[1150814]: Applied catalog in 8.35 seconds
Oct 24 13:10:29 integration-agent-docker-1057 puppet-agent[1205732]: Applying configuration version '(3834b49c0d) Majavah - P:wmcs::metricsinfra: fix karma config'                                                                           
Oct 24 13:10:36 integration-agent-docker-1057 puppet-agent[1205732]: Applied catalog in 7.91 seconds

So that broke at some point between 12:41 and 13:10 though the Puppet repository had no change. There is no change to Hiera via Horizon.

From the auth.log that is narrowed to 13:01:

Oct 24 13:01:04 integration-agent-docker-1057 sshd[1178975]: Connection from 172.16.4.137 port 45076 on 172.16.0.112 port 22 rdomain ""                                                                                                                                         
Oct 24 13:01:04 integration-agent-docker-1057 sshd[1178975]: Invalid user jenkins-deploy from 172.16.4.137 port 45076                                                                                                                                                           
Oct 24 13:01:04 integration-agent-docker-1057 sshd[1178975]: Connection closed by invalid user jenkins-deploy 172.16.4.137 port 45076 [preauth]
$ systemctl list-units --failed
  UNIT             LOAD   ACTIVE SUB    DESCRIPTION                                              
● sssd-nss.service loaded failed failed SSSD NSS Service responder
● sssd-nss.socket  loaded failed failed SSSD NSS Service responder socket

And from journalctl -u sssd-nss --since "1 day ago":

Oct 24 11:22:31 integration-agent-docker-1057 sssd_nss[1110963]: Shutting down (status = 0)
Oct 24 11:22:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Succeeded.
Oct 24 11:25:01 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 11:25:01 integration-agent-docker-1057 sssd_nss[1128372]: Starting up
Oct 24 11:35:01 integration-agent-docker-1057 sssd_nss[1128372]: Shutting down (status = 0)
Oct 24 11:35:01 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Succeeded.
Oct 24 11:35:01 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 11:35:01 integration-agent-docker-1057 sssd_nss[1129821]: Starting up
Oct 24 11:52:31 integration-agent-docker-1057 sssd_nss[1129821]: Shutting down (status = 0)
Oct 24 11:52:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Succeeded.
Oct 24 11:55:01 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 11:55:01 integration-agent-docker-1057 sssd_nss[1132519]: Starting up
Oct 24 12:48:52 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Main process exited, code=exited, status=70/SOFTWARE
Oct 24 12:48:52 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:48:53 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 1.
Oct 24 12:48:53 integration-agent-docker-1057 systemd[1]: Stopped SSSD NSS Service responder.
Oct 24 12:48:53 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 12:49:24 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Main process exited, code=exited, status=70/SOFTWARE
Oct 24 12:49:24 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:49:24 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 2.
Oct 24 12:49:24 integration-agent-docker-1057 systemd[1]: Stopped SSSD NSS Service responder.
Oct 24 12:49:24 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 12:49:30 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Oct 24 12:49:30 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:49:30 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 3.
Oct 24 12:49:30 integration-agent-docker-1057 systemd[1]: Stopped SSSD NSS Service responder.
Oct 24 12:49:30 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 12:49:30 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Oct 24 12:49:30 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 4.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Stopped SSSD NSS Service responder.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 5.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Stopped SSSD NSS Service responder.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 6.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Stopped SSSD NSS Service responder.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Started SSSD NSS Service responder.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 7.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Stopped SSSD NSS Service responder.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Start request repeated too quickly.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: sssd-nss.service: Failed with result 'exit-code'.
Oct 24 12:49:31 integration-agent-docker-1057 systemd[1]: Failed to start SSSD NSS Service responder.

So eventually it had an exit code 70 twice then 3. After some restart requests systemd bailed out and refused to start it again.

I have fixed it using:

sudo cumin --force 'name:agent-docker-1057' 'systemctl start sssd-nss.socket'

So the chain I think is:

  • ssh to host
  • triggers pam LDAP or that sssd or whatever
  • open the socket managed by sssd-nss.socket
  • systemd spin up the sssd-nss.service
  • the service fails and end up reaching the restart counter limit at which point it is disabled
  • the sssd-nss.socket is shut down / marked failed
  • auth is broken (except from cumin which uses a key in /etc/ssh and thus bypass that system
hashar claimed this task.
hashar added a subscriber: dcaro.

From /var/log/sssd/sssd_nss.log:

(2023-10-24 11:52:31): [nss] [orderly_shutdown] (0x0040): Shutting down (status = 0)(2023-10-24 11:55:01): [nss] [server_setup] (0x0040): Starting with debug level = 0x0070
(2023-10-24 12:48:53): [nss] [server_setup] (0x0040): Starting with debug level = 0x0070
(2023-10-24 12:49:24): [nss] [server_setup] (0x0040): Starting with debug level = 0x0070
(2023-10-24 12:49:30): [nss] [sbus_dbus_connect_address] (0x0020): Unable to register to unix:path=/var/lib/sss/pipes/private/sbus-dp_wikimedia.org [org.freedesktop.DBus.Error.NoReply]: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
(2023-10-24 12:49:30): [nss] [sss_dp_init] (0x0010): Failed to connect to backend server.
(2023-10-24 12:49:30): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector
(2023-10-24 12:49:30): [nss] [nss_process_init] (0x0010): sss_process_init() failed

@dcaro mentioned upstream issue https://github.com/SSSD/sssd/issues/6219 6219 - sssd don't restart properly after being killed by watchdog #6219.

A reboot of the instance would have created a fresh sssd-nss.socket which would have fixed the authentication failure.

Mentioned in SAL (#wikimedia-releng) [2023-10-25T09:30:25Z] <hashar> mediawiki-core-doxygen-docker had intermittent failure due to integration-agent-docker-1057 refusing ssh connection since Oct 24th 13:00 # T349681