Page MenuHomePhabricator

Alert Management Review and Improvement for ServiceOps
Open, LowPublic

Description

This is an umbrella task for reviewing and identifying various alerts related to MediaWiki and other ServiceOps new areas within operations/alerts, and determining what needs to be updated.

Additionally, we can gather alert statistics from logstash alerts, which could help inform adjustments to alerting thresholds.

Potential areas

  • Grafana
    • Deprecated dashboards
    • "Noisy" dashboards (i.e., too many panels, little information)
    • Introduction of more comprehensive dashboards for oncallers
  • Documentaton
    • Runbooks
    • Troubleshooting/cheatsheet updates
  • Incident Response
    • Review of past incidents to assess:
    • What alerts fired and what didn't
    • What information is missing that would help an oncaller identify the area of the issue

Event Timeline

jijiki renamed this task from Update MediaWiki and ServiceOps alerts to Alert Management Review and Improvement for ServiceOps.Nov 13 2025, 2:07 PM

@jijiki I'd recommend reframing this task to a problem statement, maybe with title 'high volume of unactionable alerts for oncall'?

And in the description or as comments, adding more color to size the problem: anecdotal examples from recent incidents, what do we see as the most 'unactionable' (wrong threshold, confusing name/description in the alert, missing runbook, irrelevant alerts).

@hnowlan if you have ideas on how to measure that

For example, today there was a page for Mathoid, with two alert, one for the blackbox[[ https://grafana.wikimedia.org/goto/P-h7Dq4Dg?orgId=1 | test ]] and another for OpenAPI/Swagger endpoints being unhealthy , and then hen we got the ATS alert on top of that, since the real cause was a scraper. Additionally, honestly, I wouldn't remember off the top of my head that its URL path contains media/math/, which would have made pattern matching so much easier and faster.

@jijiki I'd recommend reframing this task to a problem statement, maybe with title 'high volume of unactionable alerts for oncall'?

And in the description or as comments, adding more color to size the problem: anecdotal examples from recent incidents, what do we see as the most 'unactionable' (wrong threshold, confusing name/description in the alert, missing runbook, irrelevant alerts).

I will work the description again, I agree that it is a bit broad, and we can later discuss about priority

After presentation of Denisse last week, should we try to use their tool (and make feature requests if we miss something important)?