This is an umbrella task for reviewing and identifying various alerts related to MediaWiki and other ServiceOps new areas within operations/alerts, and determining what needs to be updated.
Additionally, we can gather alert statistics from logstash alerts, which could help inform adjustments to alerting thresholds.
Potential areas
- Grafana
- Deprecated dashboards
- "Noisy" dashboards (i.e., too many panels, little information)
- Introduction of more comprehensive dashboards for oncallers
- Documentaton
- Runbooks
- Troubleshooting/cheatsheet updates
- Incident Response
- Review of past incidents to assess:
- What alerts fired and what didn't
- What information is missing that would help an oncaller identify the area of the issue