Page MenuHomePhabricator

Create dashboard to track key authentication metrics before, during and after AuthManager rollout
Closed, ResolvedPublic

Description

The AuthManager project will be changing various implementation aspects of the user login and account creation flow. It will be very important to ensure that these changes are not causing regressions in the normal rate of successful logins and account creations across the Wikimedia projects.

Metrics to track:

  • Successful logins via Special:UserLogin
  • Successful logins via API action=login
  • Failed logins via Special:UserLogin
  • Failed logins via API action=login
  • Account creations via Special:Userlogin/signup
  • Account creations via API action=createaccount

Dashboard at https://grafana.wikimedia.org/#/dashboard/db/authentication-metrics

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bd808 raised the priority of this task from to Normal.Mar 5 2015, 9:48 PM
bd808 updated the task description. (Show Details)
bd808 added subscribers: Anomie, csteipp, Legoktm and 2 others.

Adding captchas would be good:

  • Captcha presentation
  • Captcha solve
  • Captcha fail

We already log the last two, just not the first, iirc.

Nuria added a subscriber: Nuria.Mar 25 2015, 12:37 AM

@bd808

How are you planning on tracking this data? server side event logging? Other? I am asking as you probably want to have this instrumented and have a baseline before you start changing the user login.

bd808 added a subscriber: Tgr.Mar 25 2015, 12:56 AM

@Nuria I think these will all be server side events, and yes we will want the logging in place before we start rolling out updates to the existing workflows to get a baseline measurement. Hopefully the team can get started on adding any missing instrumentation very early in Q4. Luckily @Tgr will be joining us; he has past experience with getting things instrumented properly that will come in very handy.

Nuria added a comment.Mar 25 2015, 1:09 AM

@bd808 Excellent, sounds that we are set here. One last comment: let's make sure that when we are thinking of reporting stats we report "relative" measures, like "the rate of failed logins versus successful logins" ( <failed logins> /<successful logins>) . Rates should not be subjected to fluctuation if we change sampling rates or we get a spike of login users due to some enwiki wide event for example.

Tgr claimed this task.Apr 3 2015, 7:50 PM
bd808 added a subscriber: ori.Apr 7 2015, 9:26 PM

After a longish discussion on irc @ori suggested that the most direct path to instrumenting the code for these events is to use hooks and the WikimediaEvents extension. See https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/WikimediaEventsHooks.php#L50-104 as an example.

Tgr added a comment.Apr 16 2015, 5:38 PM
  • Successful logins via Special:UserLogin - LoginAuthenticateAudit hook (there is also UserLoginComplete but it overcounts a bit as it includes visting the login page through a returnto link while logged in)
  • Successful logins via API action=login () - as above
  • Failed logins via Special:UserLogin - LoginAuthenticateAudit hook is called for failed password authentication, but not in a number of other cases (e.g. wrong token, failed captcha)
  • Failed logins via API action=login - as above
  • Account creations via Special:Userlogin/signup - AddNewAccount hook for success, nothing for failure
  • Account creations via API action=createaccount - AddNewAccount hook for success, AddNewAccountApiResult for both
  • Captcha presentation (I'll assume this is only about login/registration captchas) - this is itself done in hooks so as far as I can see there is no way for another hook to spy on it
  • Captcha solve - as above
  • Captcha fail - as above

So it seems there are no hooks for a number of cases. Given the already messy state of MediaWiki hooks (cf. Requests for comment/Inventory hooks, assess need), I would rather not pile on any more. Probably better to do this on top of T95356.

Change 205864 had a related patch set uploaded (by Gergő Tisza):
Track key authentication metrics

https://gerrit.wikimedia.org/r/205864

Change 205865 had a related patch set uploaded (by Gergő Tisza):
Fire event on captcha display/success/failure.

https://gerrit.wikimedia.org/r/205865

Change 205869 had a related patch set uploaded (by Gergő Tisza):
Log auth-related wfTrack events to statsd

https://gerrit.wikimedia.org/r/205869

Do the proposed metrics help in finding out whether people are being successfully logged in on all wikis, or have to login separately many times?

We're still dealing with the fallout from SUL2, I must say I dread the next iteration. :(

bd808 added a comment.Jul 18 2015, 7:08 PM

Do the proposed metrics help in finding out whether people are being successfully logged in on all wikis, or have to login separately many times?

The metrics we are concerned with are primarily about authentication attempts and their outcome (success/failure) as described in T91701#1212974. This would be indirectly related to CentralAuth but not directly attempting to measure if the SUL loginwiki interactions are successful in setting cross wiki cookies or not. The AuthStack replacement of AuthPlugin is a general MediaWiki feature change and not directly a CentralAuth or SUL related project.

This would be indirectly related to CentralAuth but not directly attempting to measure if the SUL loginwiki interactions are successful in setting cross wiki cookies or not.

Ok, but why and what's the expected meaning of those numbers? For instance, if autologin fails I'm forced to login manually and "Successful logins via Special:UserLogin" goes up, but that's a bad thing (wasted user time, frustrated users).

I can't identify a single metric among the proposed ones which would make us able to tell "ok, things are going well", other than utter catastrophes where an enormous amount of manual logins fail.

Tgr added a comment.Jul 18 2015, 9:27 PM

I can't identify a single metric among the proposed ones which would make us able to tell "ok, things are going well", other than utter catastrophes where an enormous amount of manual logins fail.

It doesn't have to be enormous, just significantly larger than the current rate of failures.

If things are going well, none of the metrics should change, given that this is not a user-facing change.

Tgr added a comment.Jul 18 2015, 9:34 PM

Do the proposed metrics help in finding out whether people are being successfully logged in on all wikis, or have to login separately many times?

Not as far as I can see. Maybe we should count loginwiki events and other wiki events under different keys.

Change 226666 had a related patch set uploaded (by Gergő Tisza):
[WIP] Handler to count log events with a 'track-topic' key

https://gerrit.wikimedia.org/r/226666

Change 226666 abandoned by Gergő Tisza:
[WIP] Handler to count log events with a 'track-topic' key

Reason:
In hindsight this is rather pointless - there is no advantage to adding a magic key to a log context over just calling stats->increment at the same place, and this cannot easily cover nontrivial key generation the way https://gerrit.wikimedia.org/r/#/c/205869/3/WikimediaEventsHooks.php does.

https://gerrit.wikimedia.org/r/226666

Change 226951 had a related patch set uploaded (by Gergő Tisza):
Log event on captcha display/success/failure.

https://gerrit.wikimedia.org/r/226951

Change 226955 had a related patch set uploaded (by Gergő Tisza):
Track key authentication metrics

https://gerrit.wikimedia.org/r/226955

Change 226956 had a related patch set uploaded (by Gergő Tisza):
Count log events in the authmanager channel

https://gerrit.wikimedia.org/r/226956

Change 226951 had a related patch set uploaded (by Gergő Tisza):
Log event on captcha display/success/failure.
https://gerrit.wikimedia.org/r/226951

What's the advantage over captcha.log?

Tgr added a comment.Jul 25 2015, 6:36 PM

What's the advantage over captcha.log?

Don't know, I'm not familiar with that one. The idea here is to pipe the log channel into statsd (instead of a file) so we can see changes in volume.

Change 205864 abandoned by Gergő Tisza:
Track key authentication metrics

Reason:
wfTrack has been abandoned

https://gerrit.wikimedia.org/r/205864

Change 205865 abandoned by Gergő Tisza:
Fire event on captcha display/success/failure.

Reason:
wfTrack has been abandoned

https://gerrit.wikimedia.org/r/205865

Change 205869 abandoned by Gergő Tisza:
Log auth-related wfTrack events to statsd

Reason:
wfTrack has been abandoned

https://gerrit.wikimedia.org/r/205869

Change 226955 merged by jenkins-bot:
Track key authentication metrics

https://gerrit.wikimedia.org/r/226955

Change 226951 merged by jenkins-bot:
Log event on captcha display/success/failure.

https://gerrit.wikimedia.org/r/226951

Change 227630 had a related patch set uploaded (by Gergő Tisza):
Add configuration for authmetrics logging

https://gerrit.wikimedia.org/r/227630

Nemo_bis removed a subscriber: Nemo_bis.Jul 29 2015, 9:04 AM

Change 226956 merged by jenkins-bot:
Count log events in the authmanager channel

https://gerrit.wikimedia.org/r/226956

Change 227630 merged by jenkins-bot:
Add configuration for authmetrics logging

https://gerrit.wikimedia.org/r/227630

Change 229618 had a related patch set uploaded (by Gergő Tisza):
Enable authmetrics logging on group0 wikis

https://gerrit.wikimedia.org/r/229618

Change 229618 merged by jenkins-bot:
Enable authmetrics logging on group0 wikis

https://gerrit.wikimedia.org/r/229618

Change 230034 had a related patch set uploaded (by Gergő Tisza):
Log human-readable login status

https://gerrit.wikimedia.org/r/230034

Change 230034 merged by jenkins-bot:
Log human-readable login status

https://gerrit.wikimedia.org/r/230034

Is there an estimate on when this will be setup for all wikis?

Tgr added a comment.Sep 16 2015, 9:38 PM

Uh, as soon as I don't forget about it... thanks for reminding :)

Change 238978 had a related patch set uploaded (by Gergő Tisza):
Enable authmetrics logging everywhere

https://gerrit.wikimedia.org/r/238978

Change 238978 merged by jenkins-bot:
Enable authmetrics logging everywhere

https://gerrit.wikimedia.org/r/238978

Tgr added a comment.Sep 18 2015, 12:24 AM

Now deployed everywhere. (The error ratio board is broken due to grafana #2484.)

Thanks Gergo!

Tgr added a comment.Sep 18 2015, 9:48 PM

Some interesting observations, completely unrelated to the task at hand:

  • there about 100x more captcha displays than captcha submit attempts (whether successful or not). Spambots? (They are almost all via web though.)
  • the ratio of successful and failed captcha attempts is about 1:1. That's rather poor. (Although again failures might be inflated by spambots.)
  • about 30% of logins result in a "user does not exist" error. That is surprisingly high.
  • API account creations have a huge failure ratio (about 10:1 - that's very high even when taking into account that it involves two "artificial" failures, for token and captcha); most of it seems attributable to session failures.
bd808 updated the task description. (Show Details)Oct 1 2015, 11:29 PM
In T91701#1652055, @Tgr wrote:

Now deployed everywhere. (The error ratio board is broken due to grafana #2484.)

@Tgr should we make a task to track getting Grafana updated (bug is supposedly fixed upstream) and then close this task as done?

Tgr added a comment.Oct 1 2015, 11:54 PM

That would be cool although a big change (T108546#1627937). But I can probably just transclude the metrics to get two long and unreadable but correct ones. This task was left open for maybe adding latency to the dashboard, which depends on the api.log task and is probably saner to track separately.

bd808 added a comment.Oct 2 2015, 4:22 PM
In T91701#1695829, @Tgr wrote:

That would be cool although a big change (T108546#1627937). But I can probably just transclude the metrics to get two long and unreadable but correct ones. This task was left open for maybe adding latency to the dashboard, which depends on the api.log task and is probably saner to track separately.

I made a backport of the upstream patch to our pre-2.x version: P2144

@ori said he would do the needful to apply it to our Grafana.

Tgr added a comment.Oct 2 2015, 10:51 PM

The task for upgrading to Grafana 2 is T104738.

bd808 added a comment.Oct 2 2015, 11:30 PM

I made a backport of the upstream patch to our pre-2.x version: P2144
@ori said he would do the needful to apply it to our Grafana.

The patch is applied and it has fixed the error rates panel of the dashboard:

You may have to force reload to get the updated javascript if you have had grafana open since before the patch was applied and/or Varnish expired the js from its cache.

Is there anything left to do here?

Tgr closed this task as Resolved.Oct 27 2015, 11:49 PM
Tgr added a comment.Jan 12 2016, 3:26 AM

Beta version: http://grafana.wmflabs.org/dashboard/db/authentication-metrics (graphs are fragile, reloading a couple times helps)

After deploying the extensions updates (but not core SessionManager), no visible change:

Tgr added a comment.Jan 12 2016, 3:29 AM

Although "no visible change" seems to mean zero success rate. But manual account creation is rare on beta, so maybe that's legit?