Page MenuHomePhabricator

Add monitoring for SGE queue status
Closed, ResolvedPublic

Description

We need to notice queues in alarm/error states before they actually cause issues. The lack of monitoring means we missed

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added subscribers: valhallasw, Aklapper.
valhallasw added a subscriber: yuvipanda.

@yuvipanda, could you take a look at this?

We need to do something like that with the prometheus stats in general. I'm leaning toward setting up alertmanager and sending emails, since that's all in Toolforge until we figure out shinken replacement here T236547: "shinken" Cloud VPS project jessie deprecation

Even then, it might be worth it to have that.

taavi claimed this task.
taavi subscribed.

Done as a proof of concept for metricsinfra per-project alert rules:

(08:18) configadmin@wu5emp5wblz.svc.trove.eqiad1.wikimedia.cloud:[prometheusconfig]> select * from alerts where project_id = (select id from projects where openstack_id = 'tools')\G
*************************** 1. row ***************************
         id: 1
 project_id: 12
       name: GridQueueProblem
       expr: sge_queueproblems{project="tools",state="e"}
   duration: 30m
   severity: warn
annotations: {"summary": "Grid queue {{ $labels.queue }}@{{ $labels.host }} is in state {{ $labels.state }}"}
1 row in set (0.003 sec)

Those now send emails to cloud-admin-feed and IRC alerts to #wikimedia-cloud-feed.

Restricted Application edited projects, added User-Majavah; removed Toolforge. · View Herald TranscriptAug 7 2021, 8:20 AM