Page MenuHomePhabricator

cloudcumin is slow when targeting many cloud VMs
Closed, ResolvedPublic

Description

This example Cumin command targets 131 cloud VMs:

cumin -b30 'O{project:tools}' 'id'

Running it in cloud-cumin-03.cloudinfra.eqiad1.wikimedia.cloud completes in about 7 seconds, while running it in cloudcumin1001.eqiad.wmnet completes in about 60 seconds.

(In T323484 we had to increase connect_timeout to 20 seconds in cloudcumin hosts, otherwise it would not even complete.)

Debugging this slowness with @dcaro we uncovered two main reasons and some possible solutions.

Reasons

SSH connections from cloudcumin to a Cloud VM use the so-called "double jump": they first connect to a Squid proxy at install1004.eqiad.wmnet, then to a SSH bastion host at restricted.bastion.wmcloud.org.

  • In the bastion, each connection from cloudcumin starts an instance of ssh-key-ldap-lookup, and that eats a lot of CPU, causing the bastion to slow down (load average climbs to more than 10)
  • We tried using SSH multiplexing, which resolves the load problem on the bastion (ssh-key-ldap-lookup is called only once), but the shared TCP connection between cloudcumin and the bastion host is always terminated after a few seconds. We suspect the termination is caused by the Squid proxy, because we don't see this behaviour when using a multiplexed connection between our local machines and the bastion.
Possible solutions
  • Beefing up the bastion VM with more CPUs and/or optimizing the ssh-key-ldap-lookup script to use fewer resources
  • Finding out why the multiplexed connection is terminated after a few seconds
Debugging info

This is the SSH multiplexing configuration that we tried adding to /etc/cumin/ssh_config in cloudcumin1001:

ControlMaster auto
ControlPath ~/.ssh/control-%C
ControlPersist 10m

Running a manual SSH to establish the shared "Control" connection with verbose logging:

root@cloudcumin1001:~# SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -vvv -F /etc/cumin/ssh_config restricted.bastion.wmcloud.org

In another window:

fnegri@cloudcumin1001:~$ sudo cumin -b30 'O{project:tools}' 'id'

Cumin completes a bunch of hosts successfully and very quickly (40 PASS in 6 seconds), then it starts failing. The "Control" connection logs show:

debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Control master terminated unexpectedly
Shared connection to restricted.bastion.wmcloud.org closed.

Running a similar "Control" connection from my local machine to the same bastion, Cumin succeeds even with -b 30, so I suspect (but might be wrong) that the Squid proxy is terminating the "Control" connection.

Interestingly, running Cumin in cloudcumin1001 with -b 3 instead of -b 30, it completes successfully, and the "Control" connection is not terminated. So why would Squid terminate the "Control" connection only when 30 sessions are being multiplexed?

Event Timeline

fnegri triaged this task as Medium priority.Jun 23 2023, 5:55 PM

Beefing up the bastion VM with more CPUs and/or optimizing the ssh-key-ldap-lookup script to use fewer resources

I think we can just add a if user == "root": return style statement (or something more general, for example don't look up keys from LDAP for anyone with an existing keys file) in the script. Root keys should be loaded from the file system and not from LDAP.

It seems to me that the root problem is that the order of lookup cannot be configured in sshd_config, from the manpage:

If the AuthorizedKeysCommand does not successfully authorize the user, authorization falls through to the AuthorizedKeysFile.

But it doesn't seems to be a way to specify o first check the key files and then the command.

At least for now we do have keys only for the root users:

root@bastion-restricted-eqiad1-02:/etc/ssh/userkeys# tree
.
├── root
└── root.d
    └── cumin

So bailing out immediately in the AuthorizedKeysCommand will surely speed up the process. At least for all the LDAP side of it. It will still need to spin up a python process each time. If we want to avoid that too we could change the AuthorizedKeysCommand with a bash wrapper that if the user is root bails out and if not calls the existing ldap checker with all the arguments.

I checked and apparently ssh-key-ldap-lookup already bails out when looking up for keys for root. One more possible option also would be to use a Match block in the sshd config to set AuthorizedKeysCommand to /bin/true to skip the Python startup entirely.

The restricted bastion is a g3.cores1.ram2.disk20, so we could just increase its resources and see if it helps. It needs an OS upgrade anyways.

One more possible option also would be to use a Match block in the sshd config

What would you match against?

I was also thinking we could set up a separate bastion VM that does not support LDAP at all, and is only used by cloudcumins.

So bailing out immediately in the AuthorizedKeysCommand will surely speed up the process. At least for all the LDAP side of it. It will still need to spin up a python process each time. If we want to avoid that too we could change the AuthorizedKeysCommand with a bash wrapper that if the user is root bails out and if not calls the existing ldap checker with all the arguments.

Maybe we got lucky: this behaviour changed in OpenSSH 8.1p1: https://bugzilla.mindrot.org/show_bug.cgi?id=3134

In recent OpenSSH versions, the keys file is checked first, and the AuthorizedKeysCommand is only executed if a key is not found there. Hence, upgrading the bastion to Bookworm (which ships openssh-server versioon 9.2p1) might fix this problem. :fingerscrossed:

fnegri changed the task status from Open to In Progress.Sep 27 2023, 1:53 PM
fnegri claimed this task.

Change 961401 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Add new cloud restricted bastion

https://gerrit.wikimedia.org/r/961401

Change 961402 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Remove old cloud restricted bastion

https://gerrit.wikimedia.org/r/961402

Change 961401 merged by FNegri:

[operations/puppet@production] Add new cloud restricted bastion

https://gerrit.wikimedia.org/r/961401

restricted.bastion.wmcloud.org is now pointing to bastion-restricted-eqiad1-3.bastion.eqiad1.wikimedia.cloud and the wiki page has been updated with the new fingerprints.

This has improved things when running Cumin against many hosts, because htop is showing the Python process is no longer executed. It's still not as fast as I hoped, and the reason seems to be that every single Cumin connection starts a Ruby process in the bastion:

102991 root        20   0     0     0     0 Z   2.0  0.0  0:00.08 ruby -ryaml -e puts YAML.load(STDIN.read)["configuration_version"]

This is used to generate the MOTD with the latest Puppet commit in /etc/update-motd.d/97-last-puppet-run. Ideally we would disable this when using the bastion as a proxy, I'm not sure if/how this is possible.

The Ruby command above is just one of the factors. I tried the same command (sudo cumin 'O{project:tools domain:default}' 'true') in 3 different scenarios:

  1. from cloudcumin1001, without any change to the restricted bastion: 5.76hosts/s
  2. from cloudcumin1001, removing the file 97-last-puppet-run from the restricted bastion: 10.99hosts/s
  3. from cloudcumin1001, removing the entire /etc/update-motd.d directory from the restricted bastion: 15.55hosts/s
  4. from cloud-cumin-03, that connects to the target hosts directly without using the bastion: 40.18hosts/s

Maybe increasing the number of CPUs of the restricted bastion is the way to go?

I increased the number of CPUs from 1 to 4 and the speed (with no change to the motd config, point 1 in my previous comment) went from 5.76hosts/s to 18.46hosts/s. This seems a good enough result, and I will finally close this ticket as "Resolved". :)

Change 961402 merged by FNegri:

[operations/puppet@production] Remove old cloud restricted bastion

https://gerrit.wikimedia.org/r/961402

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T14:13:11Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance bastion-restricted-eqiad1-02.bastion.eqiad1.wikimedia.cloud (T340241)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T14:13:17Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance bastion-restricted-eqiad1-02.bastion.eqiad1.wikimedia.cloud (T340241)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T14:16:54Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance bastion-restricted-eqiad1-02 (T340241)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-29T14:17:00Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance bastion-restricted-eqiad1-02 (T340241)