High failure rate of account creation should trigger an alarm / page people
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Sep 19 2016, 8:49 PM

Description

When pushing wmf.19 last week, that broke creation of any account for almost 18 hours. The error rates are definitely collected as shown on T145839#2643508:

https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen

accountcreation_error.png (446×917 px, 58 KB)

A single monitoring probe against that metric would definitely have prevented the fiasco. We only found it the day after via a daily browser test that failed.

The easy first task is to add an Icinga probe that check the above metric and alarms out. I have no idea though where the notification should be sent and to whom.

Related incidents:

Details

	Subject	Repo	Branch	Lines +/-
	graphite::alerts: add alert on mediawiki account creation failures	operations/puppet	production	+15 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		akosiaris	T140942 Tracking: Monitoring and alerts for "business" metrics
		Resolved		Joe	T146090 High failure rate of account creation should trigger an alarm / page people

Event Timeline

hashar created this task.Sep 19 2016, 8:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2016, 8:49 PM

hashar mentioned this in T140942: Tracking: Monitoring and alerts for "business" metrics.Sep 19 2016, 8:50 PM

We might want separate api and non-api metrics since they have different traffic levels and often an error only affects one of them.

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Sep 19 2016, 10:44 PM

The graph from https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=14&fullscreen does the difference between api vs non-api via the Graphana property $entrypoint which currently has one of the three values:

web
api
centrallogin

Under MediaWiki.authmanager. we have several buckets:

accountcreation
autocreate
autologin
captcha
login
logout

Some we probably do not care such as failure to login over the api because of a lack of a token.

If we can come up with a list of metrics / errors to look at that would be nice. At a minimum I guess the ones on the Grafana board would be a good start namely:

MediaWiki.authmanager.login.api.failure.*
MediaWiki.authmanager.login.web.failure.*

MediaWiki.authmanager.accountcreation.api.failure.*
MediaWiki.authmanager.accountcreation.web.failure.*

Note that centrallogin does not have much traffic apparently, so a single error bring it up to a 100% failure rate.

MediaWiki.authmanager.accountcreation.centrallogin.failure.*
MediaWiki.authmanager.login.centrallogin.failure.*

Might want to then add captcha?

centrallogin is not interesting, it can be added to web or just ignored. (It was meant to count centrallogin-related logins on the loginwiki; what it actually does is count direct logins there, and those almost never happen.)

For the API success might be more interesting than failure. A misbehaving bot or a spambot that gets caught on the captcha can drive up failure rates by battering the API; if successes drop, something is wrong. (Of course that something might be Labs going down, or the most active bot stopping. So maybe API would be more noisy than worth it, after all.)

captcha failures count as login failures so no need to measure it separately.

Note that failure means the authentication code ran successfully but could not log in / register the user (e.g. a wrong password was entered). If there is an error and the code is interrupted, that's not logged in any way. There might be other ways in which authentication is prevented without any failure or error (e.g. if the form cannot be submitted due to an error in the JS validator, or the login link in the personal toolbar goes missing). So the success count is going to be a more important metric than the failure count or the rate of the two.

Also it would be nice to have "authentication-related exception count", ie. an exception counter that is only increased when the exception was thrown on Special:UserLogin or something similar. That is probably blocked on T142313.

hashar mentioned this in T146461: No ORES jobs are running since deployment of 1.28.0-wmf.20.Sep 23 2016, 9:53 AM

greg moved this task from INBOX to Watching / External on the Release-Engineering-Team board.May 20 2017, 12:17 PM

greg edited projects, added Release-Engineering-Team (Watching / External); removed Release-Engineering-Team.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:49 PM

• Phabricator_maintenance added a project: Release-Engineering-Team-TODO.Jun 12 2019, 11:44 PM

• Phabricator_maintenance moved this task from Should be empty (use Release-Engineering-Team) to Watching/External on the Release-Engineering-Team-TODO board.Jun 12 2019, 11:48 PM

• Phabricator_maintenance removed a project: Release-Engineering-Team (Watching / External).Jun 12 2019, 11:49 PM

greg added a project: Release-Engineering-Team.Jun 21 2019, 10:35 PM

greg moved this task from INBOX to Deployment services on the Release-Engineering-Team board.Jul 9 2019, 5:48 PM

greg edited projects, added Release-Engineering-Team (Deployment services); removed Release-Engineering-Team.

hashar unsubscribed.Jul 11 2019, 12:33 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:15 PM

thcipriani removed a project: Release-Engineering-Team (Deployment services).Apr 20 2021, 1:10 AM

thcipriani edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team-TODO.Apr 20 2021, 3:33 AM

thcipriani moved this task from Limbo to Watching/External on the Release-Engineering-Team (Radar) board.Apr 20 2021, 3:34 AM

Krinkle updated the task description. (Show Details)Sep 28 2021, 9:26 PM

I have filled that one as part of an incident followup task but Release-Engineering-Team is not working on it.

Removing SRE, I don't think anyone from the team(s) is working on this.

Joe claimed this task.Mar 20 2023, 11:08 AM

Joe added projects: serviceops, SRE-Sprint-Week-Sustainability-March2023.

Joe moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

Joe moved this task from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.

Change 901233 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] graphite::alerts: add alert on mediawiki account creation failures

https://gerrit.wikimedia.org/r/901233

gerritbot added a project: Patch-For-Review.Mar 20 2023, 3:18 PM

Change 901233 merged by Giuseppe Lavagetto:

[operations/puppet@production] graphite::alerts: add alert on mediawiki account creation failures

https://gerrit.wikimedia.org/r/901233

Joe moved this task from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 21 2023, 10:24 AM

Maintenance_bot removed a project: Patch-For-Review.Mar 21 2023, 10:32 AM

Joe closed this task as Resolved.Mar 21 2023, 10:36 AM

High failure rate of account creation should trigger an alarm / page peopleClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

High failure rate of account creation should trigger an alarm / page people
Closed, ResolvedPublic
Actions

Related Objects
Search...