From integration-cumin.integration.eqiad.wmflabs, cumin takes a while to execute even the most simple command. The labs project has 35 instances, a little try with the date command:
$ date; time sudo cumin --debug --force '*' 'date' Tue Sep 18 10:38:33 UTC 2018 35 hosts will be targeted: <snip> real 0m36.435s user 0m1.056s sys 0m0.328s
Some hosts replies after just 6 seconds, but most takes 30+ seconds. To the best of my knowledges they are jessie hosts.
Stopping nslcd resolves that, from T204681#4593029:
sudo systemctl stop nslcd $ date; time sudo cumin --debug --force '*' 'date' Tue Sep 18 10:59:15 UTC 2018 35 hosts will be targeted: <snip> real 0m2.889s user 0m1.024s sys 0m0.332s
I track it to keyholder ssh-agent-proxy, upon connection by a user, the proxy gets the user/group and retrieve all groups with grp.getgrall(), that requests all groups from the labs LDAP and it takes a while.
class SshAgentProxyHandler(socketserver.BaseRequestHandler): """This class is responsible for handling an individual connection to an SshAgentProxyServer.""" def get_peer_credentials(self, sock): """Return the user and group name of the peer of a UNIX socket.""" ucred = sock.getsockopt(socket.SOL_SOCKET, SO_PEERCRED, s_ucred.size) _, uid, gid = s_ucred.unpack(ucred) user = pwd.getpwuid(uid).pw_name groups = {grp.getgrgid(gid).gr_name} groups.update(g.gr_name for g in grp.getgrall() if user in g.gr_mem) return user, groups
On labs:
time python3 -c 'import grp; grp.getgrall()' real 0m0.939s user 0m0.040s sys 0m0.004s
The multiple connection requests to the keyholder end up saturating the local nslcd. The ssh agent proxy thus takes a while to relay them.