Page MenuHomePhabricator

[25Q2] Add alert monitoring for Wikifunctions services
Closed, ResolvedPublic

Description

Background
We want to introduce alert monitoring for our services so that we not only know when something goes wrong or is about to go wrong, but also to have a good pulse on the performance of our services.

Approach

  • Use log severity levels to determine how or whether to alert
    • Tools: Prometheus Alertmanager
  • Determine the channels for alerting
  • We are able to get alerts in these channels

Acceptance Criteria/Success Metrics

  • We can detect when an outage is about to happen because of an alert
  • If there is an outage, we can resolve it more quickly than we did for the last P0

Stretch Goal

  • Alerting based on keywords

Event Timeline

Jdforrester-WMF renamed this task from [25Q2] WF Services Alert Monitoring to [25Q2] Add alert monitoring for Wikifunctions services.Oct 8 2024, 1:53 PM