Maniphest T208100

cumin tries to downtime Icinga even with --no-downtime
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	Dzahn
	Oct 26 2018, 10:07 PM

Description

I tried to reinstall a host with cumin.

The host was icinga1001 and icinga servers are currently not appearing in other icinga servers as hosts.

So the first attempt to reinstall (without the --no-downtime option) failed because it told me right at the beginning it can't schedule icinga downtime. (T202782#4698844 (201810262131_dzahn_122415_icinga1001_wikimedia_org.log.)

Ok, so i added --no-downtime and tried again, and this time it started the reinstall. (T202782#4690228) (201810262133_dzahn_122885_icinga1001_wikimedia_org.log.)

But.. the bug is that later in the process i still get:

21:49:05 | icinga1001.wikimedia.org | Unable to run wmf-downtime-host: Failed to icinga_downtime
ERROR:wmf-downtime-host:Unable to run wmf-downtime-host
Traceback (most recent call last):
  File "/usr/local/sbin/wmf-downtime-host", line 67, in main
    lib.icinga_downtime(args.host, user, args.phab_task_id, title='wmf-downtime-host')
  File "/usr/local/lib/python3.5/dist-packages/wmf_auto_reimage_lib.py", line 536, in icinga_downtime
    run_cumin('icinga_downtime', icinga_host, [command])
  File "/usr/local/lib/python3.5/dist-packages/wmf_auto_reimage_lib.py", line 469, in run_cumin
    raise RuntimeError('Failed to {label}'.format(label=label))
RuntimeError: Failed to icinga_downtime

It should not be trying this when --no-downtime is set and i was only able to get to this point because i did.

Related Objects

Mentioned In: T202782: upgrade icinga server to stretch and replace einsteinium
Mentioned Here: T208108: httpd class and php7.0 - conflict with mpm_event module
T202782: upgrade icinga server to stretch and replace einsteinium

Event Timeline

Dzahn created this task.Oct 26 2018, 10:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2018, 10:07 PM

prio low because the install process continued anyways.. contrary to what i first thought it didn't fail entirely but continued after a warning

See the --new option

Actually the --new might not work either as the host is in puppetdb, sorry for the wrong suggestion.
Anyway this is kinda unrelated to the reimage script as the issue is that we don't monitor the other icinga hosts from the active one, so it's really a corner case and not sure it should be fixed hardcoding this weirdness into the reimage script.

Isn't the issue that despite saying --no-downtime it tries to set a downtime?

Mentioned in SAL (#wikimedia-operations) [2018-10-27T00:00:06Z] <mutante> icinga1001 - using wmf-auto-reimage to reinstall gets stuck at initial puppet run after reboot - Still waiting for Puppet after 105.0 minutes - aborting on cumin, loggin in directly and manually running puppet (T202782 T208100)

In T208100#4699083, @Dzahn wrote:

Isn't the issue that despite saying --no-downtime it tries to set a downtime?

No, the --no-downtime option doesn't downtime the host before the reimage, the failed downtime that you saw is the one after the reimage to avoid spam in the IRC channel from the newly re-imaged host that should have been added to Icinga (given a forced puppet run on the icinga active host), but that it didn't happen in this case because we have this weird situation that the Icinga hosts don't check each other (that by the way we should fix).
Moreover the icinga donwtime after the reimage is done on a best-effort basis in a subprocess, so even if it fails it doesn't affect the reimage process, at most can cause some spam on IRC for the alarms going off.

ACK! alright, yea, this make sense. thank you. And the reason that puppet didn't finish succesfully on first run was from T208108 and that is fixed now.

Closing as invalid.

Aklapper edited projects, added Cumin; removed SRE-tools.May 10 2023, 11:15 AM

cumin tries to downtime Icinga even with --no-downtimeClosed, InvalidPublicActions

Description

Related Objects

Event Timeline

cumin tries to downtime Icinga even with --no-downtime
Closed, InvalidPublic
Actions