Toolforge could use some better tooling for monitoring and alerting than Toolschecker currently provides. Based on a very quick look here are a few requests for better alerting that could be solved with a simple config change if we had a way to send out alerts directly from tools-prometheus:
- T215155: Toolforge: systemd monitoring node_systemd_unit_state{state="failed"} == 1
- T280741: Add a toolschecker to test the tools email relay
- T282738: Alert on CrashLoopBackOff in Toolforge infrastructure pods
I imagine setting a prometheus alertmanager isn't a challenge, but sending out pages and notifications can be somewhat tricky (authentication for acking, victorops keys living on the cloud realm, etc).