Page MenuHomePhabricator

Make alerting channel fire useful alerts
Open, In Progress, HighPublic

Description

What/why:
Our current #aw-alerts Slack channel fire notifications we do not feel the need to pay much heed to as our logs have been our largest source of informative data thus far. Our alerting alerts when there are certain errors or warnings have reached a threshold, but that doesn't always mean an incident which is one of the reasons we don't listen to the notifications.

Our recent Evaluator outages have shown us that alerting on indeed more urgent events may be necessary. It may be worth mute/disable/removing the current alerts and replace them with an alert indicating a possible incident such as an Evaluator outage. This also will allow us to have clear separation between that channel and the dashboards we daily monitor by way of Chores.

How:

  • Ensure Orch has backwards compatibility with new Eval error codes
  • Evaluator HTTP status codes to be live and firing
  • Make sure it is working and live again on our Grafana board
  • Create new alerting rules
  • Wire up the new rule on Grafana Alerting (which fires to #aw-alerts)

Event Timeline

ecarg updated the task description. (Show Details)
ecarg changed the task status from Open to In Progress.Nov 6 2025, 7:58 PM
ecarg updated the task description. (Show Details)

Change #1204582 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-11-05-063501 to 2025-11-12-122736

https://gerrit.wikimedia.org/r/1204582

Change #1204582 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-11-05-063501 to 2025-11-12-122736

https://gerrit.wikimedia.org/r/1204582

Still testing the new queries in alerting but the setup is complete

@DSantamaria thanks for the reminder, will do

Change #1211872 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-11-12-122736 to 2025-11-17-175029

https://gerrit.wikimedia.org/r/1211872