Page MenuHomePhabricator

sre.hosts.reimage failed with "No commands provided" after completion of Puppet run
Closed, ResolvedPublic

Description

On reimaging dns4004 today (bullseye, if that matters!), after the completion of the successful Puppet run check towards the end, the cookbook failed with:

        File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
	  raw_ret = runner.run()
	File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 564, in run
	  self._check_icinga()
	File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 401, in _check_icinga
	  self.icinga_host.wait_for_optimal()
	File "/usr/lib/python3/dist-packages/spicerack/icinga.py", line 667, in wait_for_optimal
	  self.recheck_failed_services()
	File "/usr/lib/python3/dist-packages/spicerack/icinga.py", line 511, in recheck_failed_services
	  self._icinga_host.run_sync(*commands, print_output=False, print_progress_bars=False)
	File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 520, in run_sync
	  return self._execute(
	File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 717, in _execute
	  ret = worker.execute()
	File "/usr/lib/python3/dist-packages/cumin/transports/clustershell.py", line 65, in execute
	  raise WorkerError('No commands provided.')

I took a peek at /var/log/spicerack/sre/hosts/reimage/202302221438_sukhe_371147_dns4004.out on cumin2002 but I can't really see anything that stands out.

Should I just try running it again or is this related to some recent changes? Thanks!

Event Timeline

We end up here, resulting in an empty list of command, which are then parsed to _icinga_host.run_sync, which fails because there are no commands. The question now is: "Should there be commands or should we just return.

def recheck_failed_services(self) -> None:
    """Force recheck of all failed associated with a set of hosts."""
    status = self.get_status()
    if status.optimal:
        return
    commands = [
        self._get_command_string("SCHEDULE_FORCED_SVC_CHECK", hostname, service_name, str(int(time.time())))
        for hostname, failed in status.failed_services.items()
        for service_name in failed
    ]
    self._icinga_host.run_sync(*commands, print_output=False, print_progress_bars=False)

The timing on this was funny, as we just released a change on the icinga module, but AFAICT, this has nothing to do with it.

So my current understanding is that the call to icinga-status on alert1001 returned 1:

2023-02-22 17:24:20,140 sukhe 371147 [DEBUG clustershell.py:783 in ev_hup] node=alert1001.wikimedia.org, rc=1, command='/usr/local/bin/i
cinga-status -j "dns4004"'

And that's expected on non-optimal state. Now when spicerack gets that output, it alerts if there is no output, it alerts if the output cannot be parsed as JSON and alerts if inside the JSON there is are any missing host.
Hence I think that the output was sane.
This status is used inside recheck_failed_services() that generates the commands to issue to force a recheck of the failed services and in this case it returned an empty list.

Why? That's the question to answer...

We ended up in case in which for some reason, that I can't explain looking at icinga logs, icinga-status returned a non-optimal status for the host while there were no failed services.
That AFAICT could happen if the host itself is not UP ( but I don't see evidence of this in Icinga logs).

So my suggestions are:

  1. Force a recheck of the host too ( via SCHEDULE_FORCED_HOST_CHECK ) in recheck_all_services() unconditionally.
  2. Force a recheck of the host too in recheck_failed_services() if the status is not optimal.
  3. Add a block in recheck_failed_services() for if not commands where we log.debug(status) so that we can check them later on and then return, because there are no commands to execute.

Change 891494 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/software/spicerack@master] Icinga: Handle edge case where status is not optimal

https://gerrit.wikimedia.org/r/891494

@ssingh did you complete manually all the steps missed by the reimage because of this failure?

Chatting with Luca I got that over 5 reimages he got 2 failures too, so this looked more like a race condition but with a higher frequency of what I tought.
Re-looking at the code change I think we missed an issue in ServiceStatus.failed that doesn't account for the other values beside OK and CRITICAL. Hence doesn't consider a service in WARNING or UNKNOWN as failed. Also the default value of current_state should be UNKNOWN and not OK.

I'd keep the proposed change above anyway as seems sane and more logically correct.

Change 891498 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/software/spicerack@master] Icinga: Service should also be marked as failed on warnings.

https://gerrit.wikimedia.org/r/891498

Change 891498 merged by jenkins-bot:

[operations/software/spicerack@master] Icinga: Service should also be marked as failed on warnings.

https://gerrit.wikimedia.org/r/891498

Volans triaged this task as High priority.Feb 23 2023, 10:19 AM

Thanks all for the quick response to this task, everyone!

@ssingh did you complete manually all the steps missed by the reimage because of this failure?

I did not run any manual steps other than checking if the Icinga status was OK; checking if the downtime was removed automatically (which it was); and I guess the only thing that was left was the Netbox data step? In such a case, if the Puppet run did finish successfully and it was a reimage (such as this), what steps should be run manually? I assumed that since none of the data changed, I could leave it like this but I guess I was wrong!

@ssingh just run the Netbox script https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ against the reimaged host (checking the commit changes checkbox) and you should be good to go.

FYI the fix is about to be released to prod.

@ssingh just run the Netbox script https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ against the reimaged host (checking the commit changes checkbox) and you should be good to go.

FYI the fix is about to be released to prod.

Thank you and done! And thanks for the quick fix!

Volans claimed this task.

This is now deployed, you can resume your reimages. Sorry for the trouble.

Change 891494 merged by jenkins-bot:

[operations/software/spicerack@master] Icinga: Handle edge case where status is not optimal

https://gerrit.wikimedia.org/r/891494

Confirming that I did a new reimage and it completed successfully. Thanks everyone who worked on this to resolve it so quickly.