Page MenuHomePhabricator

RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services?
Closed, ResolvedPublic

Description

We've (Release Engineering folks including myself, Antoine, Ahmon, Tyler) had a couple of recent situations where we bring down a service we manage for updates - Gerrit and GitLab specifically come to mind - and no one working on the task had access to schedule a downtime for the corresponding alerts, leading to unnecessary alert noise.

I went looking and found this: https://wikitech.wikimedia.org/wiki/Icinga#Scheduling_downtimes_with_a_shell_command

Is there something here we can / should file an access request for? If the realistic answer is "just coordinate with an SRE" that's fine, but I'd like to document it better for our upgrade processes.

Thanks!

cc: @Dzahn as the person I think I've most often bugged about this in the past.

Event Timeline

There are ~ 5 ways to achieve this:

a) Only with Icinga configuration- the strict way - We need to have a contact group for releng with the right members and use that contact group with the relevant puppetized Icinga checks (contact_group parameter).

Then contacts will have privileges to run commands (this means scheduling downtimes, disabling notifications etc) for "their" services via the Icinga web UI but not globally.

b) Only with Icinga configuration - the global way - We add releng contacts to global Icinga priviliges to run commands for _any_ service and host, like we do for SRE people

c) With sudo and shell admin groups - we add the releng shell admin group to alert* (Icinga) hosts and sudo privileges to schedule downtimes from the shell

d) With cookbooks without root - we get cookbooks for non-root users (There is a ticket for that)

e) With cookbooks with sudo - we allow releng shell admins on cumin* and let them run specifically the icinga downtime cookbook as root

SRE Observability may want to weight in, as I know they have been working on similar request with alertmanager, which may look like the future of alerting and multitenancy (but I could be very wrong), in addition to the options @Dzahn mentioned. Check: T281454 Adding some key people for awareness: @lmata @fgiunchedi

For short term, for speed of deployment, probably a or b are more realistic.

Thank you @Dzahn and @jcrespo, I agree short term a) or b) sound good to me and likely the way to go. Perhaps b) at this point since with alertmanager we're going towards that direction anyways.

I don't have a strong opinion on c-e tbh on which is better if downtime scheduling from the shell is desired.

Unless there are objections let's go with (b), do you need command line access or web interface is fine @brennen ?

fgiunchedi triaged this task as Medium priority.Aug 30 2021, 8:09 AM

Unless there are objections let's go with (b), do you need command line access or web interface is fine @brennen ?

Web interface should be just fine. Thanks!

Change 715735 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: add dancy,thcipriani,hashar to icinga authorized service/host

https://gerrit.wikimedia.org/r/715735

Change 715735 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: add dancy,thcipriani,hashar to icinga authorized service/host

https://gerrit.wikimedia.org/r/715735

fgiunchedi claimed this task.

Optimistically resolving! Feel free to reopen

Confirmed working for a couple of us, thanks again.