Page MenuHomePhabricator

High failure rate of account creation should trigger an alarm / page people
Closed, ResolvedPublic

Description

When pushing wmf.19 last week, that broke creation of any account for almost 18 hours. The error rates are definitely collected as shown on T145839#2643508:

https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen

accountcreation_error.png (446×917 px, 58 KB)

A single monitoring probe against that metric would definitely have prevented the fiasco. We only found it the day after via a daily browser test that failed.

The easy first task is to add an Icinga probe that check the above metric and alarms out. I have no idea though where the notification should be sent and to whom.

Related incidents:

Event Timeline

We might want separate api and non-api metrics since they have different traffic levels and often an error only affects one of them.

The graph from https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen does the difference between api vs non-api via the Graphana property $entrypoint which currently has one of the three values:

  • web
  • api
  • centrallogin

Under MediaWiki.authmanager. we have several buckets:

  • accountcreation
  • autocreate
  • autologin
  • captcha
  • login
  • logout

Some we probably do not care such as failure to login over the api because of a lack of a token.

If we can come up with a list of metrics / errors to look at that would be nice. At a minimum I guess the ones on the Grafana board would be a good start namely:

MediaWiki.authmanager.login.api.failure.*
MediaWiki.authmanager.login.web.failure.*

MediaWiki.authmanager.accountcreation.api.failure.*
MediaWiki.authmanager.accountcreation.web.failure.*

Note that centrallogin does not have much traffic apparently, so a single error bring it up to a 100% failure rate.

MediaWiki.authmanager.accountcreation.centrallogin.failure.*
MediaWiki.authmanager.login.centrallogin.failure.*

Might want to then add captcha?

centrallogin is not interesting, it can be added to web or just ignored. (It was meant to count centrallogin-related logins on the loginwiki; what it actually does is count direct logins there, and those almost never happen.)

For the API success might be more interesting than failure. A misbehaving bot or a spambot that gets caught on the captcha can drive up failure rates by battering the API; if successes drop, something is wrong. (Of course that something might be Labs going down, or the most active bot stopping. So maybe API would be more noisy than worth it, after all.)

captcha failures count as login failures so no need to measure it separately.

Note that failure means the authentication code ran successfully but could not log in / register the user (e.g. a wrong password was entered). If there is an error and the code is interrupted, that's not logged in any way. There might be other ways in which authentication is prevented without any failure or error (e.g. if the form cannot be submitted due to an error in the JS validator, or the login link in the personal toolbar goes missing). So the success count is going to be a more important metric than the failure count or the rate of the two.

Also it would be nice to have "authentication-related exception count", ie. an exception counter that is only increased when the exception was thrown on Special:UserLogin or something similar. That is probably blocked on T142313.

I have filled that one as part of an incident followup task but Release-Engineering-Team is not working on it.

akosiaris subscribed.

Removing SRE, I don't think anyone from the team(s) is working on this.

Change 901233 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] graphite::alerts: add alert on mediawiki account creation failures

https://gerrit.wikimedia.org/r/901233

Change 901233 merged by Giuseppe Lavagetto:

[operations/puppet@production] graphite::alerts: add alert on mediawiki account creation failures

https://gerrit.wikimedia.org/r/901233