Page MenuHomePhabricator

Paging for alerts from the cloud realm
Closed, ResolvedPublic

Description

The WMCS team wants to be paged for some issues found by prometheus instances in the cloud realm. Right now the alertmanager instance in the metricsinfra project can't send pages out. That AM instance is shared by multiple projects, some of which are not maintained by the WMCS team. Options include:

  • Give the shared alertmanager instance victorops api keys and let it send pages. Simple, but I guess a bit risky since this instance is shared with some projects that don't have alerting support?
  • Make a separate alertmanager instance in the cloud realm, and send pages via it. Bit more secure, but this still has victorops keys in the cloud realm. Silencing UX might be a bit annoying?
  • Import information about the paging alerts to some prometheus instance on the production realm, and then alert via that from production alertmanagers
  • Let the trusted prometheus instances in cloud talk to prod alertmanagers?
  • Something else?

Event Timeline

Nice :)

Trying to get the alerts to the prod alertmanagers I think is more convenient to me, mainly so the alerts with pages can be handled all from the same place.

Now having trusted prometheus seems the easiest, as that way we don't need to implement anything out of the standard, but pulling the alerts to prod instead seems a bit safer (so cloud still does not have a way to communicate into prod).

This might be interesting to get the interesting series from metricsinfra to prod (https://prometheus.io/docs/prometheus/latest/federation/)

Change 829746 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] p::metricsinfra:haproxy: Allow exposing federation endpoints

https://gerrit.wikimedia.org/r/829746

Change 829756 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] p::wmcs:prometheus: Add cloudvps federation job

https://gerrit.wikimedia.org/r/829756

@fgiunchedi This will work for the time being, though we will want to be able to allow volunteers (ex. toolforge roots) to silence some of these alerts, for example, when doing some maintenance on toolforge.

Some things that come to mind might be:

  • Creating an "alert puller" to import alerts directly from the metricsinfra alertmanager, that way any alert silenced on metricsinfra (the VM) could be silenced also on the prod alertmanager.
  • Creating a service that allows silencing the "cloud" alerts from within cloud realm.
  • Move the alerting from within metricsinfra, adding the api keys or similar in there (having a fully duplicated stack).

WDYT? I can open a task to discuss this further (volunteers being able to silence alerts for volunteer managed projects).

@fgiunchedi This will work for the time being, though we will want to be able to allow volunteers (ex. toolforge roots) to silence some of these alerts, for example, when doing some maintenance on toolforge.

Some things that come to mind might be:

  • Creating an "alert puller" to import alerts directly from the metricsinfra alertmanager, that way any alert silenced on metricsinfra (the VM) could be silenced also on the prod alertmanager.
  • Creating a service that allows silencing the "cloud" alerts from within cloud realm.
  • Move the alerting from within metricsinfra, adding the api keys or similar in there (having a fully duplicated stack).

WDYT? I can open a task to discuss this further (volunteers being able to silence alerts for volunteer managed projects).

Yes thank you, let's followup/discuss in a separate task!

Change 829746 abandoned by David Caro:

[operations/puppet@production] p::metricsinfra:haproxy: Allow exposing federation endpoints

Reason:

We found a nicer way of doing this :) (see the task)

https://gerrit.wikimedia.org/r/829746

Change 829756 abandoned by David Caro:

[operations/puppet@production] p::wmcs:prometheus: Add cloudvps federation job

Reason:

We found a nicer way of doing this :)

https://gerrit.wikimedia.org/r/829756

lmata added a project: Observability-Alerting.
lmata moved this task from Inbox to Radar on the Observability-Alerting board.
dcaro claimed this task.

We decided to use the same keys for metricsinfra (see T323713)

dcaro changed the task status from Duplicate to Resolved.Apr 11 2023, 9:46 AM
dcaro moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q3) board.