From:
We need to be able to allow volunteer admins (toolforge root, etc.) to silence alerts for services that WMCS gets paged for when doing maintenances and similar controlled actions.
This task is to find a working implementation with the o11y team.
Some ideas:
Creating an "alert puller" to import alerts directly from the metricsinfra alertmanager, that way any alert silenced on metricsinfra (the VM) could be silenced also on the prod alertmanager. Creating a service that allows silencing the "cloud" alerts from within cloud realm. Move the alerting from within metricsinfra, adding the api keys or similar in there (having a fully duplicated stack).
Current working idea:
- alerts.wikimedia.org pulls alerts from prometheus-alerts.wmcloud.org
- the alerts on wmcloud have a team: wmcs tag to ease the filtering in alerts.wikimedia.org
- the wikimedia karma has only the wikimedia alertmanager as default for silences (needs karma >0.133, waiting on T333615: Upgrade alert* hosts to Bookworm)
- the wikimedia karma gets access to wmcloud alertmanager to do silences using basic auth
- start working on T323713: [wmcs][alerting] Integrate metricsinfra alertmanager with victorops
- enjoy and rejoice
20221121
Some notes from a discussion between @fgiunchedi and @dcaro re: the above solutions:
The simplest approach off the bat seems to @fgiunchedi to be approach #3, IOW metricsinfra AM can send pages and the two AMs are effectively siloed/isolated.
There are however a few considerations in order and things to figure out:
- The splunk oncall service API key for Prometheus integration can be only one. In other words production and metricsinfra AM would share the same key. We need to investigate what this key can effectively do. Filippo's understanding is that the key should be able to create/resolve incidents only. Therefore if sth happens to the key (e.g. a leak) it isn't a huge blast radius and rotation is simple.
- Understand what the integration service API keys can do.
- In this case the "service API keys" are used to talk to the "rest endpoint" (a different API than the public API) and that lets you only create/resolve incidents. https://help.victorops.com/knowledge-base/rest-endpoint-integration-guide/
- Understand what the integration service API keys can do.
- We'd like to keep the "single pane of glass", i.e. look at production and metricsinfra alerts from a single Karma UI (production's). This is possible in the sense that Karma supports reading/writing to multiple alertmanagers.
- Investigate how we'd have production karma talk to (authenticated) metricsinfra AM.
- It seems we can authenticate either with certificates, or basic auth, the latter seems to be the easiest for now (https://github.com/prymitive/karma/blob/main/docs/CONFIGURATION.md#alertmanagers)
- Investigate how we'd have production karma talk to (authenticated) metricsinfra AM.
The above will enable paging alerts for metricsinfra AM (i.e. https://prometheus-alerts.wmcloud.org/) and allow WMCS folks to look at said alerts from a single place (https://alerts.w.o) while still allowing volunteer admins to manage/silences their alerts.