Page MenuHomePhabricator

spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; exiting
Open, MediumPublic

Description

Not sure if this is a bug in Spicerack or in our cookbooks, hence tagging both. I just had wmcs.vps.refresh_puppet_certs crash with:

Generating a new Puppet certificate on 1 hosts: toolsbeta-proxy-5.toolsbeta.eqiad1.wikimedia.cloud
PASS |█████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:04<00:00,  4.00s/hosts]
FAIL |                                                                                         |   0% (0/1) [00:04<?, ?hosts/s]
Exception raised while executing cookbook wmcs.vps.create_instance_with_prefix:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run
    raw_ret = runner.run()
  File "/srv/deployment/wmcs-cookbooks/wmcs_libs/common.py", line 751, in _wrapped_run
    return object.__getattribute__(self, __name)(*args, **kwargs)
  File "/srv/deployment/wmcs-cookbooks/cookbooks/wmcs/vps/create_instance_with_prefix.py", line 278, in run
    self.create_instance()
  File "/srv/deployment/wmcs-cookbooks/cookbooks/wmcs/vps/create_instance_with_prefix.py", line 384, in create_instance
    refresh_puppet_certs_cookbook.get_runner(
  File "/srv/deployment/wmcs-cookbooks/wmcs_libs/common.py", line 751, in _wrapped_run
    return object.__getattribute__(self, __name)(*args, **kwargs)
  File "/srv/deployment/wmcs-cookbooks/cookbooks/wmcs/vps/refresh_puppet_certs.py", line 153, in run
    _refresh_cert(spicerack=self.spicerack, remote_host=remote_host)
  File "/srv/deployment/wmcs-cookbooks/cookbooks/wmcs/vps/refresh_puppet_certs.py", line 62, in _refresh_cert
    cert_fingerprint = node_to_bootstrap.regenerate_certificate()[fqdn]
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 294, in regenerate_certificate
    raise PuppetHostsError(
spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are:
toolsbeta-proxy-5.toolsbeta.eqiad1.wikimedia.cloud: Error: Another puppet instance is already running and the waitforlock setting is set to 0; exiting
END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'toolsbeta-proxy'

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Yeah, it's clearly a race condition that could be solved in both places (cookbook or spicerack), no strong opinion.
The problem is that from the current code it seems that puppet doesn't clearly return a proper exit code that could help understand the problem and parsing the output is brittle.
We could add an @retry with few attempts or check the puppet lock file on error.
For context the regenerate certificate is run seldom in production, is it run more frequently in WMCS?

For context the regenerate certificate is run seldom in production, is it run more frequently in WMCS?

Yes - all Cloud VPS instances are initially provisioned to use a central Puppet server, and then some projects (including most managed by the WMCS team) have a project Puppet server that can have secrets stored, access a local PuppetDB instance, etc., and switching from the central server to a project one involves refreshing the certificates as the different servers have different CAs.

Volans triaged this task as Medium priority.