Page MenuHomePhabricator

High failure rate of account creation should trigger an alarm / page people
Open, HighPublic

Description

When pushing wmf.19 last week, that broke creation of any account for almost 18 hours. The error rates are definitely collected as shown on T145839#2643508:

https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen

A single monitoring probe against that metric would definitely have prevented the fiasco. We only found it the day after via a daily browser test that failed.

Ref:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160915-MediaWiki

The easy first task is to add an Icinga probe that check the above metric and alarms out. I have no idea though where the notification should be sent and to whom.

Event Timeline

hashar created this task.Sep 19 2016, 8:49 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2016, 8:49 PM
Tgr added a subscriber: Tgr.Sep 19 2016, 10:36 PM

We might want separate api and non-api metrics since they have different traffic levels and often an error only affects one of them.

The graph from https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen does the difference between api vs non-api via the Graphana property $entrypoint which currently has one of the three values:

  • web
  • api
  • centrallogin

Under MediaWiki.authmanager. we have several buckets:

  • accountcreation
  • autocreate
  • autologin
  • captcha
  • login
  • logout

Some we probably do not care such as failure to login over the api because of a lack of a token.

If we can come up with a list of metrics / errors to look at that would be nice. At a minimum I guess the ones on the Grafana board would be a good start namely:

MediaWiki.authmanager.login.api.failure.*
MediaWiki.authmanager.login.web.failure.*

MediaWiki.authmanager.accountcreation.api.failure.*
MediaWiki.authmanager.accountcreation.web.failure.*

Note that centrallogin does not have much traffic apparently, so a single error bring it up to a 100% failure rate.

MediaWiki.authmanager.accountcreation.centrallogin.failure.*
MediaWiki.authmanager.login.centrallogin.failure.*

Might want to then add captcha?

Tgr added a comment.Sep 20 2016, 8:43 AM

centrallogin is not interesting, it can be added to web or just ignored. (It was meant to count centrallogin-related logins on the loginwiki; what it actually does is count direct logins there, and those almost never happen.)

For the API success might be more interesting than failure. A misbehaving bot or a spambot that gets caught on the captcha can drive up failure rates by battering the API; if successes drop, something is wrong. (Of course that something might be Labs going down, or the most active bot stopping. So maybe API would be more noisy than worth it, after all.)

captcha failures count as login failures so no need to measure it separately.

Tgr added a comment.Sep 20 2016, 6:22 PM

Note that failure means the authentication code ran successfully but could not log in / register the user (e.g. a wrong password was entered). If there is an error and the code is interrupted, that's not logged in any way. There might be other ways in which authentication is prevented without any failure or error (e.g. if the form cannot be submitted due to an error in the JS validator, or the login link in the personal toolbar goes missing). So the success count is going to be a more important metric than the failure count or the rate of the two.

Also it would be nice to have "authentication-related exception count", ie. an exception counter that is only increased when the exception was thrown on Special:UserLogin or something similar. That is probably blocked on T142313.

hashar removed a subscriber: hashar.Jul 11 2019, 12:33 PM