Page MenuHomePhabricator

cumin tries to downtime Icinga even with --no-downtime
Closed, InvalidPublic

Description

I tried to reinstall a host with cumin.

The host was icinga1001 and icinga servers are currently not appearing in other icinga servers as hosts.

So the first attempt to reinstall (without the --no-downtime option) failed because it told me right at the beginning it can't schedule icinga downtime. (T202782#4698844 (201810262131_dzahn_122415_icinga1001_wikimedia_org.log.)

Ok, so i added --no-downtime and tried again, and this time it started the reinstall. (T202782#4690228) (201810262133_dzahn_122885_icinga1001_wikimedia_org.log.)

But.. the bug is that later in the process i still get:

21:49:05 | icinga1001.wikimedia.org | Unable to run wmf-downtime-host: Failed to icinga_downtime
ERROR:wmf-downtime-host:Unable to run wmf-downtime-host
Traceback (most recent call last):
  File "/usr/local/sbin/wmf-downtime-host", line 67, in main
    lib.icinga_downtime(args.host, user, args.phab_task_id, title='wmf-downtime-host')
  File "/usr/local/lib/python3.5/dist-packages/wmf_auto_reimage_lib.py", line 536, in icinga_downtime
    run_cumin('icinga_downtime', icinga_host, [command])
  File "/usr/local/lib/python3.5/dist-packages/wmf_auto_reimage_lib.py", line 469, in run_cumin
    raise RuntimeError('Failed to {label}'.format(label=label))
RuntimeError: Failed to icinga_downtime

It should not be trying this when --no-downtime is set and i was only able to get to this point because i did.

Event Timeline

Dzahn created this task.Oct 26 2018, 10:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2018, 10:07 PM
Dzahn triaged this task as Low priority.Oct 26 2018, 10:11 PM

prio low because the install process continued anyways.. contrary to what i first thought it didn't fail entirely but continued after a warning

Volans added a subscriber: Volans.Oct 26 2018, 10:42 PM

See the --new option

Actually the --new might not work either as the host is in puppetdb, sorry for the wrong suggestion.
Anyway this is kinda unrelated to the reimage script as the issue is that we don't monitor the other icinga hosts from the active one, so it's really a corner case and not sure it should be fixed hardcoding this weirdness into the reimage script.

Isn't the issue that despite saying --no-downtime it tries to set a downtime?

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2018-10-27T00:00:06Z] <mutante> icinga1001 - using wmf-auto-reimage to reinstall gets stuck at initial puppet run after reboot - Still waiting for Puppet after 105.0 minutes - aborting on cumin, loggin in directly and manually running puppet (T202782 T208100)

Isn't the issue that despite saying --no-downtime it tries to set a downtime?

No, the --no-downtime option doesn't downtime the host before the reimage, the failed downtime that you saw is the one after the reimage to avoid spam in the IRC channel from the newly re-imaged host that should have been added to Icinga (given a forced puppet run on the icinga active host), but that it didn't happen in this case because we have this weird situation that the Icinga hosts don't check each other (that by the way we should fix).
Moreover the icinga donwtime after the reimage is done on a best-effort basis in a subprocess, so even if it fails it doesn't affect the reimage process, at most can cause some spam on IRC for the alarms going off.

Dzahn closed this task as Invalid.Oct 29 2018, 7:24 PM

ACK! alright, yea, this make sense. thank you. And the reason that puppet didn't finish succesfully on first run was from T208108 and that is fixed now.

Closing as invalid.