Page MenuHomePhabricator

[SPIKE] Determine how to trigger alarms for API issues
Closed, InvalidPublic5 Estimated Story Points

Description

Description

We need better mechanisms for knowing that something went wrong as we invest in API experimentation and evolution. Configuring automated alarms will ensure that we are notified about any issues early, enabling us to respond before there are broader impacts to our users or the community as a whole.

Conditions of acceptance

  • Make a recommendation for how the team should be alerted about issues (Slack, email, Phab)
  • Demonstrate effectiveness through a proof of concept (which is not necessarily tied to real alerting capabilities, but can send a message to the recommended channel(s) ).
  • Specific types of alerts that may arise:
    • Number of specific log entries per time
    • % of requests resulting in error
    • [Stretch/nice to have] Latency spikes --> This is already largely covered by SRE monitoring. Make a recommendation for if we need additional coverage for API performance monitoring specifically. If so, how should we do it to ensure a good level of sensitivity? For example, average over x period of time vs historical average
  • Alert thresholds are configurable

Implementation details

Helpful docs: https://wikitech.wikimedia.org/wiki/Alertmanager

Event Timeline

HCoplin-WMF updated the task description. (Show Details)
HCoplin-WMF set the point value for this task to 5.