We need to do something like that with the prometheus stats in general. I'm leaning toward setting up alertmanager and sending emails, since that's all in Toolforge until we figure out shinken replacement here T236547: "shinken" Cloud VPS project jessie deprecation

Even then, it might be worth it to have that.

Done as a proof of concept for metricsinfra per-project alert rules:

(08:18) configadmin@wu5emp5wblz.svc.trove.eqiad1.wikimedia.cloud:[prometheusconfig]> select * from alerts where project_id = (select id from projects where openstack_id = 'tools')\G
*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="tools",state="e"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}"}
1 row in set (0.003 sec)

Those now send emails to cloud-admin-feed and IRC alerts to #wikimedia-cloud-feed.

Restricted Application edited projects, added User-Majavah; removed Toolforge. · View Herald TranscriptAug 7 2021, 8:20 AM

Add monitoring for SGE queue status Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Add monitoring for SGE queue status
Closed, ResolvedPublic
Actions