Change Details

From integration-cumin.integration.eqiad.wmflabs, it takes a while to execute even the most simple command. The labs project has 35 instances, a little try with the `date` command: ``` $ date; time sudo cumin --debug --force '*' 'date' Tue Sep 18 10:38:33 UTC 2018 35 hosts will be targeted: xxx ===== NODE GROUP ===== (2) integration-slave-docker-[1006,1031].integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:39:07 UTC 2018 ===== NODE GROUP ===== (16) integration-publishing.integration.eqiad.wmflabs,integration-puppetmaster01.integration.eqiad.wmflabs,integration-slave-docker-[1005,1008-1010,1012-1013,1021,1023-1024,1033].integration.eqiad.wmflabs,integration-slave-jessie-[1001,1004].integration.eqiad.wmflabs,integration-slave-jessie-android.integration.eqiad.wmflabs,jenkinstest.integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:39:09 UTC 2018 ===== NODE GROUP ===== (11) castor02.integration.eqiad.wmflabs,integration-cumin.integration.eqiad.wmflabs,integration-r-lang-01.integration.eqiad.wmflabs,integration-slave-docker-[1007,1011,1022,1034].integration.eqiad.wmflabs,integration-slave-jessie-[1002-1003].integration.eqiad.wmflabs,saucelabs-01.integration.eqiad.wmflabs,webperformance.integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:39:06 UTC 2018 ===== NODE GROUP ===== (1) integration-slave-docker-1030.integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:38:50 UTC 2018 ===== NODE GROUP ===== (5) integration-slave-docker-[1014-1015,1017].integration.eqiad.wmflabs,saucelabs-[02-03].integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:38:39 UTC 2018 ================ PASS: |█████████████████████████████████████████████████████████████████████████████████████████████| 100% (35/35) [00:35<00:00, 1.01s/hosts] FAIL: | | 0% (0/35) [00:35<?, ?hosts/s] 100.0% (35/35) success ratio (>= 100.0% threshold) for command: 'date'. 100.0% (35/35) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. real 0m36.435s user 0m1.056s sys 0m0.328s ``` Some hosts replies after just 6 seconds, but most takes 30+ seconds. To the best of my knowledges they are jessie hosts. Stopping `nslcd` resolves that, from T204681#4593029: ``` sudo systemctl stop nslcd $ date; time sudo cumin --debug --force '*' 'date' Tue Sep 18 10:59:15 UTC 2018 35 hosts will be targeted: .... ----- OUTPUT of 'date' ----- Tue Sep 18 10:59:18 UTC 2018 PASS: |█████████████████████████████████████████████████████████████████████████████████████████████| 100% (35/35) [00:01<00:00, 5.58hosts/s] FAIL: | | 0% (0/35) [00:01<?, ?hosts/s] 100.0% (35/35) success ratio (>= 100.0% threshold) for command: 'date'. 100.0% (35/35) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. real 0m2.889s user 0m1.024s sys 0m0.332s ```

From integration-cumin.integration.eqiad.wmflabs, itcumin takes a while to execute even the most simple command. The labs project has 35 instances, a little try with the `date` command: ``` $ date; time sudo cumin --debug --force '*' 'date' Tue Sep 18 10:38:33 UTC 2018 35 hosts will be targeted: xxx<snip> ===== NODE GROUP ===== (2) integration-slave-docker-[1006,1031].integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:39:07 UTC 2018 ===== NODE GROUP ===== (16) integration-publishing.integration.eqiad.wmflabs,integration-puppetmaster01.integration.eqiad.wmflabs,integration-slave-docker-[1005,1008-1010,1012-1013,1021,1023-1024,1033].integration.eqiad.wmflabs,integration-slave-jessie-[1001,1004].integration.eqiad.wmflabs,integration-slave-jessie-android.integration.eqiad.wmflabs,jenkinstest.integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:39:09 UTC 2018 ===== NODE GROUP ===== (11) castor02.integration.eqiad.wmflabs,integration-cumin.integration.eqiad.wmflabs,integration-r-lang-01.integration.eqiad.wmflabs,integration-slave-docker-[1007,1011,1022,1034].integration.eqiad.wmflabs,integration-slave-jessie-[1002-1003].integration.eqiad.wmflabs,saucelabs-01.integration.eqiad.wmflabs,webperformance.integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:39:06 UTC 2018 ===== NODE GROUP ===== (1) integration-slave-docker-1030.integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:38:50 UTC 2018 ===== NODE GROUP ===== (5) integration-slave-docker-[1014-1015,1017].integration.eqiad.wmflabs,saucelabs-[02-03].integration.eqiad.wmflabs ----- OUTPUT of 'date' ----- Tue Sep 18 10:38:39 UTC 2018 ================ PASS: |█████████████████████████████████████████████████████████████████████████████████████████████| 100% (35/35) [00:35<00:00, 1.01s/hosts] FAIL: | | 0% (0/35) [00:35<?, ?hosts/s] 100.0% (35/35) success ratio (>= 100.0% threshold) for command: 'date'. 100.0% (35/35) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. real 0m36.435s user 0m1.056s sys 0m0.328s ``` Some hosts replies after just 6 seconds, but most takes 30+ seconds. To the best of my knowledges they are jessie hosts. Stopping `nslcd` resolves that, from T204681#4593029: ``` sudo systemctl stop nslcd $ date; time sudo cumin --debug --force '*' 'date' Tue Sep 18 10:59:15 UTC 2018 35 hosts will be targeted: ....<snip> ----- OUTPUT of 'date' ----- Tue Sep 18 10:59:18 UTC 2018 PASS: |█████████████████████████████████████████████████████████████████████████████████████████████| 100% (35/35) [00:01<00:00, 5.58hosts/s] FAIL: | | 0% (0/35) [00:01<?, ?hosts/s] 100.0% (35/35) success ratio (>= 100.0% threshold) for command: 'date'. 100.0% (35/35) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. real 0m2.889s user 0m1.024s sys 0m0.332s ``` I track it to keyholder ssh-agent-proxy, upon connection by a user, the proxy gets the user/group and retrieve all groups with `grp.getgrall()`, that requests all groups from the labs LDAP and it takes a while. ``` lang=python,name=modules/keyholder/files/ssh-agent-proxy class SshAgentProxyHandler(socketserver.BaseRequestHandler): """This class is responsible for handling an individual connection to an SshAgentProxyServer.""" def get_peer_credentials(self, sock): """Return the user and group name of the peer of a UNIX socket.""" ucred = sock.getsockopt(socket.SOL_SOCKET, SO_PEERCRED, s_ucred.size) _, uid, gid = s_ucred.unpack(ucred) user = pwd.getpwuid(uid).pw_name groups = {grp.getgrgid(gid).gr_name} groups.update(g.gr_name for g in grp.getgrall() if user in g.gr_mem) return user, groups ``` On labs: ``` time python3 -c 'import grp; grp.getgrall()' real 0m0.939s user 0m0.040s sys 0m0.004s ``` The multiple connection requests to the keyholder end up saturating the local nslcd. The ssh agent proxy thus takes a while to relay them.