Page MenuHomePhabricator

Prometheus alerting support on Toolforge
Closed, ResolvedPublic

Description

Toolforge could use some better tooling for monitoring and alerting than Toolschecker currently provides. Based on a very quick look here are a few requests for better alerting that could be solved with a simple config change if we had a way to send out alerts directly from tools-prometheus:

I imagine setting a prometheus alertmanager isn't a challenge, but sending out pages and notifications can be somewhat tricky (authentication for acking, victorops keys living on the cloud realm, etc).

Event Timeline

We currently have a separate project of metricsinfra that is running a prometheus instance with alertmanager running. It was thrown together as a proof of concept in T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken and then work on it has not continued. You can view the alerts from that at https://vpsalertmanager.toolforge.org

The problems:

  • It's not flexible, any monitored projects are normally monitored exactly the same way.
  • It emails cloud-admin-feed (I think) so basically, WMCS gets emails alerts and that's it. It should email the project admins.
  • It maintains only short-term metrics.

Some work to improve that is in T266050: Build Prometheus service for use by all Cloud VPS projects and their instances. Toolforge has special needs in all this, but ideally this work should sync up with that work/plan. If you would like to tackle some of it, perhaps you need admin on metricsinfra. Just solving some parts of that ticket with additional config flexibility in puppet would probably resolve this ticket.

The alerts go to https://lists.wikimedia.org/postorius/lists/cloud-admin-feed.lists.wikimedia.org/ if you want to try to subscribe to what is there. It's very generic and does not currently address crash looping pods, email or systemd

The alerts go to https://lists.wikimedia.org/postorius/lists/cloud-admin-feed.lists.wikimedia.org/ if you want to try to subscribe to what is there. It's very generic and does not currently address crash looping pods, email or systemd

I think they never ended up on the list because it had a ban set for ^(?!.*(wikimedia|wikipedia)\.org$) and the mails were apparently coming from root at wmflabs dot org. I changed the regex to ^(?!.*(wikimedia|wikipedia|wmflabs|wmcloud)\.org$).

I used the bulk subscribe feature to add root@wmflabs.org as a user who doesn't get emails but can post to the list without moderation so that we can get alerts with it.

Change 890490 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: deploy alert rules from GitLab

https://gerrit.wikimedia.org/r/890490

Change 890490 merged by David Caro:

[operations/puppet@production] P:toolforge::prometheus: deploy alert rules from GitLab

https://gerrit.wikimedia.org/r/890490