Page MenuHomePhabricator

[wmcs][alerting] Allow volunteer admins silencing alerts from cloudvps/toolforge/paws/quarry
Open, Stalled, HighPublic

Description

From:

We need to be able to allow volunteer admins (toolforge root, etc.) to silence alerts for services that WMCS gets paged for when doing maintenances and similar controlled actions.

This task is to find a working implementation with the o11y team.

Some ideas:

Creating an "alert puller" to import alerts directly from the metricsinfra alertmanager, that way any alert silenced on metricsinfra (the VM) could be silenced also on the prod alertmanager.

Creating a service that allows silencing the "cloud" alerts from within cloud realm.

Move the alerting from within metricsinfra, adding the api keys or similar in there (having a fully duplicated stack).

Current working idea:

20221121

Some notes from a discussion between @fgiunchedi and @dcaro re: the above solutions:
The simplest approach off the bat seems to @fgiunchedi to be approach #3, IOW metricsinfra AM can send pages and the two AMs are effectively siloed/isolated.

There are however a few considerations in order and things to figure out:

  • The splunk oncall service API key for Prometheus integration can be only one. In other words production and metricsinfra AM would share the same key. We need to investigate what this key can effectively do. Filippo's understanding is that the key should be able to create/resolve incidents only. Therefore if sth happens to the key (e.g. a leak) it isn't a huge blast radius and rotation is simple.
  • We'd like to keep the "single pane of glass", i.e. look at production and metricsinfra alerts from a single Karma UI (production's). This is possible in the sense that Karma supports reading/writing to multiple alertmanagers.

The above will enable paging alerts for metricsinfra AM (i.e. https://prometheus-alerts.wmcloud.org/) and allow WMCS folks to look at said alerts from a single place (https://alerts.w.o) while still allowing volunteer admins to manage/silences their alerts.

Details

Other Assignee
fgiunchedi

Event Timeline

dcaro triaged this task as High priority.Oct 17 2022, 2:48 PM
dcaro created this task.

A quick recap to make sure I understand the problem statement:

  • a wmcs service (e.g. cloudvps, toolforge, paws) undergoes maintenance not initiated by wmcs team but by the volunteer admins
  • wmcs team gets paged for legitimate alerts due to the maintenance
  • hilarity and confusion ensue
  • a way to pre-emptively silence paging alerts by volunteer admins is needed

Is that a fair representation of the problem @dcaro ?

A few followup questions, do we have a way to group said volunteer admins e.g. in ldap? any particular group?

Is that a fair representation of the problem @dcaro ?

Yes :)

do we have a way to group said volunteer admins e.g. in ldap?

Yes, there's ldap groups and similar, but depends on the projects.

For Paws and Quarry, it simple, as it's just belonging to the group project-paws or project-quarry.

For toolforge it is a bit more complex, as there's "being part of toolforge cloud vps" (any toolforge user) and "being toolforge root" (details here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#What_makes_a_root/Giving_root_access).
I think though we can just use 'tools.admin' (it's not exactly the same, but any root should be part of that group too).

I think though we can just use 'tools.admin' (it's not exactly the same, but any root should be part of that group too).

Yeah. Just keep in mind that it's under ou=servicegroups, not ou=groups.

Also I think that currently at least almost all of us active volunteer admins (me, Lucas, TNT) are also in cn=nda, which includes powers to silence prod AM alerts (but not Icinga).

which includes powers to silence prod AM alerts (but not Icinga).

That is actually awesome! You use the UI for that or there's an api/cli you can use? (thinking on how to integrate with cookbooks for example)

Unfortunately, it does not include all the admin for paws/quarry. Not sure if we want to enforce the nda to belong to quarry/paws/toolforge admins.

which includes powers to silence prod AM alerts (but not Icinga).

That is actually awesome! You use the UI for that or there's an api/cli you can use? (thinking on how to integrate with cookbooks for example)

The UI, which is behind CAS so not really scriptable.

I was working on https://gitlab.wikimedia.org/repos/cloud/wmcs/amimporter on Friday (and spilled over a bit into the weekend, I wanted to play with socks proxies and asyncio xd).

It's almost ready (missing some more testing, and maybe adding/changing labels to control what becomes a page), but that will allow us to:

  • Run that script in prod (so we can monitor/page when it's down)
  • It adds/removes some selected alerts
  • It adds/removes silences for them if they are silenced in the original alertmanager
  • Leaves control of what to page on to the WMCS team
  • Makes all the alerts that WMCS needs to act on go on the same alertmanger

Something that it does not do:

  • Sync silences back to the original alertmanager (so comments on the destination don't show up in the original), could be done.

I've been testing it using a socks proxy and a local alertmanager running on docker, but we would need to open the two /api/v1/silences and /alerts endpoints for 'GET' on the metricsinfra alertmanager to be able to scrape them.

That would allow to detach both instances effectively, allowing specific project owners to silence their alerts, and the WMCS team to still see the alerts on the prod alertmanager, and get paged by them if needed.

Let me know if you have any concerns/questions/ideas

btw. this is related to T285055 (though not blocking each other) it's closed already :)

@fgiunchedi I have done a last round of fixes for the importer, I'm running it locally and will run it for a while, but your reviews here would be appreciated:
https://gitlab.wikimedia.org/repos/cloud/wmcs/amimporter/-/merge_requests/1

It needs access to the destination alertmanager api, where should I run it from? cloudmetrics? (do those hosts have access to alertmanager api?)

As an update to this, started working on a way to avoid setting the cloud alertmanager as one of the default alertmanagers set when creating a new silence:
https://github.com/prymitive/karma/pull/5086

This unblocks allowing basic auth to be used between prod karma and metricsinfra alertmanager.

dcaro updated the task description. (Show Details)
dcaro changed the task status from Open to In Progress.Oct 12 2023, 9:55 AM
dcaro moved this task from To refine to Doing on the User-dcaro board.

Change 965475 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] metricsinfra.alertmanager: add victorops and paging route

https://gerrit.wikimedia.org/r/965475

dcaro changed the task status from In Progress to Stalled.Oct 20 2023, 8:10 AM
dcaro moved this task from Doing to Blocked on the User-dcaro board.