Page MenuHomePhabricator

[5.5.3 Milestone] Improve API Monitoring & Alarms
Open, Needs TriagePublic

Description

Description

We currently have a few dashboards for monitoring API activity. However, they do not highlight the most impactful and relevant metrics. The charts are difficult to filter and interpret, and some data is missing.

Additionally, although we have some monitoring dashboards in place, we currently lack any automated alarms for spikes in error responses. This results missing partial outages and not catching accidental breaking changes. Evidence of this surfaced late in 2024, when RESTbase rerouting resulted in a missed header parameter, which was not observed until a member of the community raised it. Alarms will enable us to detect issues earlier, and ultimately be more responsive.

Conditions of acceptance

  • Update the dashboards to include useful missing data, such as:
    • HTTP response codes
    • % of total requests that result in errors over time
    • Average latency per endpoint
  • Make dashboards more usable, with better filtering and visualizations.
  • Raise automated alarms for scenarios where error rates exceed expected thresholds.
    • Create Slack and/or Phabricator integration to alert the team of issues.

Implementation details

Existing Action API dashboards

Existing REST API dashboards: