Page MenuHomePhabricator

Exception raised while executing cookbook sre.hosts.downtime
Closed, ResolvedPublic

Description

When using wmf-auto-reimage-host today on multiple hosts there was twice an issue with an exception when it tries to set Icinga downtimes. It happened on 2 separate runs for one host each, wtp2002 and wtp2003.

Things were going just fine for the reinstall except for the downtime there is suddenly spicerack.remote.RemoteError: No hosts provided.

16:13:12 | wtp2003.codfw.wmnet | Polling until a Puppet sign request appears
16:13:16 | wtp2003.codfw.wmnet | Signed Puppet cert
16:13:18 | wtp2003.codfw.wmnet | Validated host
16:13:18 | wtp2003.codfw.wmnet | Scheduled delayed downtime on Icinga
16:13:18 | wtp2003.codfw.wmnet | Started first puppet run (sit back, relax, and enjoy the wait)
START - Cookbook sre.hosts.downtime
Exception raised while executing cookbook sre.hosts.downtime:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 410, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/downtime.py", line 56, in run
    remote_hosts = spicerack.remote().query(args.query)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 323, in query
    return RemoteHosts(self._config, hosts, dry_run=self._dry_run)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 373, in __init__
    raise RemoteError('No hosts provided')
spicerack.remote.RemoteError: No hosts provided
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)

This resulted in Icinga alerts on IRC until I manually ran sre.hosts.downtime which worked fine.

Using wmf-auto-reimage for multiple hosts at once, which i was doing at the same time, did not show this issue.

The relevant log files are on cumin1001:

/var/log/wmf-auto-reimage/202007291532_dzahn_8935_wtp2002_codfw_wmnet_cumin.out
/var/log/wmf-auto-reimage/202007291532_dzahn_8935_wtp2002_codfw_wmnet.log

/var/log/wmf-auto-reimage/202007291533_dzahn_10043_wtp2003_codfw_wmnet_cumin.out
/var/log/wmf-auto-reimage/202007291533_dzahn_10043_wtp2003_codfw_wmnet.log

At the time of writing this the cookbooks have not ended yet and are still running.

Event Timeline

Dzahn updated the task description. (Show Details)

From the cookbook logs:

2020-07-29 16:15:18,960 dzahn 15835 [DEBUG puppetdb.py:320 in _execute] Queried puppetdb for '["or", ["=", "certname", "wtp2003.codfw.wmnet"]]', got '0' results.

It means that puppetdb didn't yet registered the new host.I've seen this happen once today with @JMeybohm.
So either the catalog compilation for wtp servers is particularly slow or we might have some slow down in puppetdb. From a quick look at the grafana dashboard the queue is always small but the p99 of the latency is high, so maybe we have some cases of slow recording.

As a quick fix we could increase the sleep for the subprocess that runs the cookbook to 5 minutes. A better fix would be to add some logic to poll puppetdb until the catalog is there.

so either catalog compilation for wtp servers is particularly slow

This seems likely to be the case. On the first run it does a git pull from deployment_server.

In some cases yesterday i had to run puppet multiple times because of some race condition (the downtime issue did not happen then but they were also wtp* hosts).

No, I meant the catalog compilation on the puppetmaster, after which the catalog is sent to puppetdb. It's unrelated to how much time takes the first puppet run on the actual host.

Ah, ok!

It happened again when running wtp2004 (separate screen window, separate script).

`spicerack.remote.RemoteError: No hosts provided
`

Then i manually used sre.hosts.downtime again and, as opposed to earlier, this also fails now.

START - Cookbook sre.hosts.downtime
Downtiming 1 hosts and all their services for 3:00:00: wtp2004.codfw.wmnet
Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: wtp2004.codfw.wmnet
Exception raised while executing cookbook sre.hosts.downtime:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 410, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/downtime.py", line 67, in run
    icinga.downtime_hosts(remote_hosts.hosts, reason, duration=duration)
  File "/usr/lib/python3/dist-packages/spicerack/icinga.py", line 222, in downtime_hosts
    self._icinga_host.run_sync(*commands)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 476, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 646, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)

edit: ACK, that is because currently it is not in Icinga.

herron triaged this task as Medium priority.Jul 31 2020, 7:00 PM

The reimage scripts have been converted to the sre.hosts.reimage cookbook and don't have anymore the race condition that was present here, hence resolving. The new procedure is outlined at https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage

Volans claimed this task.