Page MenuHomePhabricator

Improve error charts on authentication-metrics dashboard
Open, Needs TriagePublic

Description

Error charts in the bottom block of https://grafana.wikimedia.org/d/000000004/authentication-metrics are messy, and make it hard to assess whether there's something bad happening. We should clean them up.

  • Error rates are very high (e.g. 50% for account creation). We should differentiate between error codes that likely correspond to common user errors (mostly spambots failing the captcha), uncommon user errors (e.g. invalid_returnUrlToken is technically user-triggerable but should be very rare) and server-side errors (though not really sure we have any of those in these charts).
  • Error rate charts are noisy, smoothening + overlaying last week's data would be nice.
  • For error detail charts, make sure the unit is independent of the viewed time range (e.g. errors per minute) and that this is easy to understand.
  • Also, sort the legend by error frequency (and display the totals).
  • Fix T137582: Weird error statuses in createaccount metrics

Once that's done, we might want to check if the high rate of some errors is a reason for concern.