Page MenuHomePhabricator

Updating Scap on beta cluster hosts with cumin fails
Closed, ResolvedPublic

Description

I'm trying to do a Scap release, and as part of that running cumin on deployment-deploy01, but this fails.

deployment-imagescaler01.deployment-prep.eqiad.wmflabs has a host key not in my SSH known hosts file. I don't know where to find the new key reliably. Also, why has it changed? This worked when I did the previous Scap release a few weeks ago.

Am I just doing something wrong?

Event Timeline

I'm trying to do a Scap release, and as part of that running cumin on deployment-deploy01, but this fails.

Should be running on deployment-cumin02.

Ran sudo cumin -p90 -b20 -s1 'O{project:deployment-prep}' 'hostname' on deployment-cumin02 and got a similar problem:

===== NODE GROUP =====                                                                                                                                 
(2) deployment-imagescaler01.deployment-prep.eqiad.wmflabs,deployment-sentry01.deployment-prep.eqiad.wmflabs                                           
----- OUTPUT of 'hostname' -----                                                                                                                       
Permission denied (publickey).

Not sure where cumin holds known hosts.

AFAIK, deployment-imagescaler01 hasn't been touched in a good while; however, I get a message about a new hostkey from my local machine as well, FWIW. Unsure if this is related to T244642.

Volans added subscribers: Andrew, Volans.

Few days ago the global WMCS cumin public key was rotated (see [1] and [2]). This per-se has nothing to do with the cumin installation in deployment-prep.

But, on the two hosts that are failing, puppet is disabled since very long time:

The last Puppet run was at Wed Jan 15 16:11:12 UTC 2020 (209186 minutes ago).
The last Puppet run was at Sun Mar 29 21:43:46 UTC 2020 (102303 minutes ago).

This means that the change would not have been applied, but instead the /etc/ssh/userkeys/root.d/cumin file has been modified on June 6th on both hosts, my guess is that was modified manually.
The normal content of that file in deployment-prep has two entries, one to auth the global WMCS cumin and another to auth the local deployment-prep cumin installation. The current version of that file on both hosts has only one entry, and the local deployment-prep cumin is not authorized.

CC @Andrew to confirm if the above assumptions are correct.

The best way to fix the issue would be to re-enable puppet. If that's not possible you can look at the content of that file in other hosts in deployment-prep and for now hack-fixing it manually.

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/3ba6e7b76e91377d2b298b0246db9128f362b403
[2] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/63f7be2a9afa3715d5ff3f4959941549a1f20266

@Volans is correct that I updated the cumin key vi scp. I did that because puppet (which typically would have updated cumin) is broken on that instance:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Package[python-thumbor-wikimedia]' in parameter 'require' (file: /etc/puppet/modules/thumbor/manifests/init.pp, line: 97) on node deployment-imagescaler01.deployment-prep.eqiad.wmflabs

Presumably one could add the cumin host key by hand.

I'm not at all clear why any of the above would have changed the ssh host key, though... does scap use cumin? Did tyler when he did whatever he did to get "I get a message about a new hostkey from my local machine as well, FWIW"?

I reported my problem badly and confusingly, sorry.

As part of making a Scap release, I need to test Scap on the beta cluster. One step in that is to upgrade it on all hosts where it's installed, using a command like this:

~~~sh
sudo cumin 'O{project:deployment-prep}' 'command -v scap && echo apt-ge
t install -y --allow-downgrades scap=3.15.0-1+0~20200608164257.92~1.gbpfff362 || echo "no scap"'
~~~

This is the command that failed. Note that Scap doesn't use cumin itself.

I just ran the command again, and again it failed. The cumin output is difficult to capture well from a terminal (long lines, many lines, etc), but the error I see is:

~~~
ssh: Could not resolve hostname deployment-logstash02.deployment-prep.eqiad.wmflabs
~~~

Indeed, that host doesn't seem to be in DNS.

Is DNS wrong? Is cumin badly configured? Am I doing something wrong?

The host key change error message has gone away in the meanwhile. I'll retitle the task.

LarsWirzenius renamed this task from SSH host key for deployment-imagescaler01.deployment-prep.eqiad.wmflabs changed? to Updating Scap on beta cluster hosts with cumin fails.Jun 18 2020, 10:27 AM

This looks like it's something interesting! The project contains two logstash hosts: deployment-logstash2 and deployment-logstash03; cumin is generalizing that to deployment-logstash[02-03] which is obviously not right.

The short story here is: clustershell (and, hence, cumin) can't cope with different levels of zero-padding in hostnames. No short-term fix is coming for this, so I suggest rebuilding one or both of those hosts with a consistent naming scheme.

I tried the following just now, to see if cumin works for me now.

~~~sh
sudo cumin 'O{project:deployment-prep}' hostname
~~~

I got errors:

~~~
ssh: Could not resolve hostname deployment-logstash02.deployment-prep.eqiad.wmflabs: Name or service not known
ssh: Could not resolve hostname deployment-perfapt01.deployment-prep.eqiad.wmflabs: Name or service not known
ssh: Could not resolve hostname deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs: Name or service not known
~~~

I this still due to "can't cope with different levels of zero-padding in hostnames" or is there something I can do to work around this?

Context is still that I need to try a few things in the beta cluster to do a scap release. Help would be appreciated.

This is still happening. The "sudo cumin 'O{project:deployment-prep}' hostname" command fails. I need to run another command to test Scap for making a release, but that won't work either if the simpler command doesn't.

When cumin tells me a name or service is not know, sometimes ping finds it without the leading zero in the name. hostname deployment-logstash02.deployment-prep.eqiad.wmflabs => ping deployment-logstash2.deployment-prep.eqiad.wmflabs works. However, this is not always true.

I tried adding an ssh config to alias the 02 name to just 2, but that didn't work.

I can't figure out from cumin's documentation how to have a selector expression that excludes the hosts by name, either.

Workaround: exclude the problematic hosts:

sudo cumin 'O{project:deployment-prep} and not D{deployment-logstash02.deployment-prep.eqiad.wmflabs,deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs,deployment-imagescaler01.deployment-prep.eqiad.wmflabs}' true

Triaging as low, as there's a documented workaround.

LarsWirzenius claimed this task.

The https://wikitech.wikimedia.org/wiki/Cumin page now has this example that I've used succesfully:

O{project:deployment-prep} and not D{deployment-logstash02.deployment-prep.eqiad1.wikimedia.cloud,deployment-imagescaler01.deployment-prep.eqiad1.wikimedia.cloud} (exclude specific hosts)

I think this can therefore be closed.