Page MenuHomePhabricator

hook up prometheus @ cloudmetrics* to an alertmanager
Closed, ResolvedPublic

Description

We have a pair prometheus instance ("labs", could have a better name) running on cloudmetrics* hardware collecting metrics from various OpenStack services running on the production realm.

Is it fine if we hook it up to production Alertmanager (and possibly Thanos) instances or do we need to host our own Alertmanager?

Event Timeline

Sending alerts to the production alertmanager sounds good to me, everything needed should be already in puppet in terms of information/configuration/etc (ditto for thanos). Please loop me in in the reviews when the time comes

Change 765561 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Hook up cloudmetrics prometheus to alertmanager

https://gerrit.wikimedia.org/r/765561

Change 765561 merged by David Caro:

[operations/puppet@production] Hook up cloudmetrics prometheus to alertmanager

https://gerrit.wikimedia.org/r/765561

Change 765567 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::prometheus: deploy alert rule from ops/alerts.git

https://gerrit.wikimedia.org/r/765567

Change 765567 merged by David Caro:

[operations/puppet@production] P:wmcs::prometheus: deploy alert rule from ops/alerts.git

https://gerrit.wikimedia.org/r/765567

Change 767790 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] alertmanager: add basic wmcs routing rules

https://gerrit.wikimedia.org/r/767790

Change 767790 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] alertmanager: add basic wmcs routing rules

https://gerrit.wikimedia.org/r/767790

Most of the alerts that ops/alerts.git deploys to all instances don't seem to be useful to us but should be harmless if they are left there. However there are a few (most notably the scrape failed and mysql exporter down alerts) which could be triggered on the cloudmetrics hosts too but would have wrong team labels and so would alert the wrong people.

I'm not sure how to approach those. The easiest solution would be to add filtertags comments to those alert files to not deploy them on cloudmetrics hosts. That then raises the question if we should explicitly require that all alert files specify which hosts they are deployed to. On the other hand, the alert rules themself would be useful, it's the team label only that we want to change and a way to avoid almost copy-pasted alert definitions would be nice.

@fgiunchedi thoughts?

Thanks @Majavah for bringing this up, one "hammer" approach I can think of is to keep deploying all alerts as now, but force team=wmcs via alert relabeling on the cloud prometheus. This is quite wide since it applies to everything but it seems the desired semantic (?)

That sounds otherwise a good solution, but we might want to not deploy certain alerts there (I don't have an example right now, but I can imagine there will be some alert in alerts.git that fires on our data but we don't want alerts on).

Correct, that's for the general relabeling description; and specifically Prometheus can relabel outbound alerts with https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alert_relabel_configs

In T302493#7759668, @Majavah wrote:

That sounds otherwise a good solution, but we might want to not deploy certain alerts there (I don't have an example right now, but I can imagine there will be some alert in alerts.git that fires on our data but we don't want alerts on).

ATM the granularity of deploying alerts.git alerts is per-file, though we can certainly think of sth when/if the use case you are describing comes up

Change 771384 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::prometheus: set team: wmcs on all alerts

https://gerrit.wikimedia.org/r/771384

Change 771384 merged by Filippo Giunchedi:

[operations/puppet@production] P:wmcs::prometheus: set team: wmcs on all alerts

https://gerrit.wikimedia.org/r/771384