This example Cumin command targets 131 cloud VMs:
cumin -b30 'O{project:tools}' 'id'
Running it in cloud-cumin-03.cloudinfra.eqiad1.wikimedia.cloud completes in about 7 seconds, while running it in cloudcumin1001.eqiad.wmnet completes in about 60 seconds.
(In T323484 we had to increase connect_timeout to 20 seconds in cloudcumin hosts, otherwise it would not even complete.)
Debugging this slowness with @dcaro we uncovered two main reasons and some possible solutions.
Reasons
SSH connections from cloudcumin to a Cloud VM use the so-called "double jump": they first connect to a Squid proxy at install1004.eqiad.wmnet, then to a SSH bastion host at restricted.bastion.wmcloud.org.
- In the bastion, each connection from cloudcumin starts an instance of ssh-key-ldap-lookup, and that eats a lot of CPU, causing the bastion to slow down (load average climbs to more than 10)
- We tried using SSH multiplexing, which resolves the load problem on the bastion (ssh-key-ldap-lookup is called only once), but the shared TCP connection between cloudcumin and the bastion host is always terminated after a few seconds. We suspect the termination is caused by the Squid proxy, because we don't see this behaviour when using a multiplexed connection between our local machines and the bastion.
Possible solutions
- Beefing up the bastion VM with more CPUs and/or optimizing the ssh-key-ldap-lookup script to use fewer resources
- Finding out why the multiplexed connection is terminated after a few seconds
Debugging info
This is the SSH multiplexing configuration that we tried adding to /etc/cumin/ssh_config in cloudcumin1001:
ControlMaster auto ControlPath ~/.ssh/control-%C ControlPersist 10m
Running a manual SSH to establish the shared "Control" connection with verbose logging:
root@cloudcumin1001:~# SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -vvv -F /etc/cumin/ssh_config restricted.bastion.wmcloud.org
In another window:
fnegri@cloudcumin1001:~$ sudo cumin -b30 'O{project:tools}' 'id'
Cumin completes a bunch of hosts successfully and very quickly (40 PASS in 6 seconds), then it starts failing. The "Control" connection logs show:
debug3: mux_client_read_packet: read header failed: Broken pipe debug2: Control master terminated unexpectedly Shared connection to restricted.bastion.wmcloud.org closed.
Running a similar "Control" connection from my local machine to the same bastion, Cumin succeeds even with -b 30, so I suspect (but might be wrong) that the Squid proxy is terminating the "Control" connection.
Interestingly, running Cumin in cloudcumin1001 with -b 3 instead of -b 30, it completes successfully, and the "Control" connection is not terminated. So why would Squid terminate the "Control" connection only when 30 sessions are being multiplexed?