From integration-cumin.integration.eqiad.wmflabs, itcumin takes a while to execute even the most simple command. The labs project has 35 instances, a little try with the `date` command:
```
$ date; time sudo cumin --debug --force '*' 'date'
Tue Sep 18 10:38:33 UTC 2018
35 hosts will be targeted:
xxx<snip>
===== NODE GROUP =====
(2) integration-slave-docker-[1006,1031].integration.eqiad.wmflabs
----- OUTPUT of 'date' -----
Tue Sep 18 10:39:07 UTC 2018
===== NODE GROUP =====
(16) integration-publishing.integration.eqiad.wmflabs,integration-puppetmaster01.integration.eqiad.wmflabs,integration-slave-docker-[1005,1008-1010,1012-1013,1021,1023-1024,1033].integration.eqiad.wmflabs,integration-slave-jessie-[1001,1004].integration.eqiad.wmflabs,integration-slave-jessie-android.integration.eqiad.wmflabs,jenkinstest.integration.eqiad.wmflabs
----- OUTPUT of 'date' -----
Tue Sep 18 10:39:09 UTC 2018
===== NODE GROUP =====
(11) castor02.integration.eqiad.wmflabs,integration-cumin.integration.eqiad.wmflabs,integration-r-lang-01.integration.eqiad.wmflabs,integration-slave-docker-[1007,1011,1022,1034].integration.eqiad.wmflabs,integration-slave-jessie-[1002-1003].integration.eqiad.wmflabs,saucelabs-01.integration.eqiad.wmflabs,webperformance.integration.eqiad.wmflabs
----- OUTPUT of 'date' -----
Tue Sep 18 10:39:06 UTC 2018
===== NODE GROUP =====
(1) integration-slave-docker-1030.integration.eqiad.wmflabs
----- OUTPUT of 'date' -----
Tue Sep 18 10:38:50 UTC 2018
===== NODE GROUP =====
(5) integration-slave-docker-[1014-1015,1017].integration.eqiad.wmflabs,saucelabs-[02-03].integration.eqiad.wmflabs
----- OUTPUT of 'date' -----
Tue Sep 18 10:38:39 UTC 2018
================
PASS: |█████████████████████████████████████████████████████████████████████████████████████████████| 100% (35/35) [00:35<00:00, 1.01s/hosts]
FAIL: | | 0% (0/35) [00:35<?, ?hosts/s]
100.0% (35/35) success ratio (>= 100.0% threshold) for command: 'date'.
100.0% (35/35) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
real 0m36.435s
user 0m1.056s
sys 0m0.328s
```
Some hosts replies after just 6 seconds, but most takes 30+ seconds. To the best of my knowledges they are jessie hosts.
Stopping `nslcd` resolves that, from T204681#4593029:
```
sudo systemctl stop nslcd
$ date; time sudo cumin --debug --force '*' 'date'
Tue Sep 18 10:59:15 UTC 2018
35 hosts will be targeted:
....<snip>
----- OUTPUT of 'date' -----
Tue Sep 18 10:59:18 UTC 2018
PASS: |█████████████████████████████████████████████████████████████████████████████████████████████| 100% (35/35) [00:01<00:00, 5.58hosts/s]
FAIL: | | 0% (0/35) [00:01<?, ?hosts/s]
100.0% (35/35) success ratio (>= 100.0% threshold) for command: 'date'.
100.0% (35/35) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
real 0m2.889s
user 0m1.024s
sys 0m0.332s
```
I track it to keyholder ssh-agent-proxy, upon connection by a user, the proxy gets the user/group and retrieve all groups with `grp.getgrall()`, that requests all groups from the labs LDAP and it takes a while.
```
lang=python,name=modules/keyholder/files/ssh-agent-proxy
class SshAgentProxyHandler(socketserver.BaseRequestHandler):
"""This class is responsible for handling an individual connection
to an SshAgentProxyServer."""
def get_peer_credentials(self, sock):
"""Return the user and group name of the peer of a UNIX socket."""
ucred = sock.getsockopt(socket.SOL_SOCKET, SO_PEERCRED, s_ucred.size)
_, uid, gid = s_ucred.unpack(ucred)
user = pwd.getpwuid(uid).pw_name
groups = {grp.getgrgid(gid).gr_name}
groups.update(g.gr_name for g in grp.getgrall() if user in g.gr_mem)
return user, groups
```
On labs:
```
time python3 -c 'import grp; grp.getgrall()'
real 0m0.939s
user 0m0.040s
sys 0m0.004s
```
The multiple connection requests to the keyholder end up saturating the local nslcd. The ssh agent proxy thus takes a while to relay them.