Page MenuHomePhabricator

Improve (or identify) monitoring for CentralAuth autologins on Wikimedia wikis
Closed, ResolvedPublic

Description

Currently CentralAuth autologin (where users who manually logged in or registered on one wiki get seamlessly logged in on another wiki) is a nice to have; if it fails (which is not uncommon), the user can just enter their password and log in the usual way. For temporary accounts autologin will be the only way to access their account on another wiki, though. If it doesn't work, they will be unable to affect other wikis (e.g. upload files to Commons, or link pages to Wikidata). Of course they can just register a normal account, but we should at least be aware to what extent temporary accounts are crippled.

We should log autologin failures in some way (or identify existing logging). There isn't really a way to differentiate between failures where the user has no active session on the login wiki and ones where that session exists but the browser prevents access to it (as the only difference is the existence of a cookie which the browser doesn't give access to) but we should log cases when it seems the user should have a central session (e.g. recently created account) but autologin fails.

This would also help with monitoring the effects of T326281: Attempt top-level central autologin when visiting the login page (to allow autologin when the browser blocks third-party cookies).

Related Objects

Event Timeline

There are some authevents log events for central autologin, although not present on the authentication-metrics or authentications dashboards; that could be a start.

We should also check if all cross-wiki features that work without the user navigating to the other wiki (such as VisualEditor drag-and-drop file upload, or linking pages to Wikidata) use or fall back to the centralauthtoken API (which is a bit inconvenient but isn't affected by browser limitations).

@Tgr Any thoughts on if this should stay in Growth's current sprint or if this might be owned by the MediaWiki Core team in the future?

pmiazga updated Other Assignee, added: pmiazga.

There is also EventLogging data via Schema:CentralAuth apparently.

Autocreation error logging is broken since rMW734f0c23e377: update authevents logging status context to use string representation directly. We should probably fix that.

Broken how? I found some log entries generated from that code and reporting autocreation errors: https://logstash.wikimedia.org/goto/60cb2963281f561f437934c1ae0eb78b

Or do you mean that the format is too creative?

image.png (217×1 px, 29 KB)

It's unhelpful, but I think I wanted to say that monitoring is broken. That's done by AuthManagerStatsdHandler in WikimediaEvents, and it really expects a Status for the status parameter. The Monolog Logstash handler OTOH cannot handle a Status (for no good reason, IMO; if the parameter is stringifiable, it should just stringify it). There is more context in the code review for that patch.

Urbanecm_WMF subscribed.

Moving this out of the Growth boards (and altering subtasking), as CentralAuth is not a part of Growth responsibilities. If Growth is needed here, please let us know.

Change 965517 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] Fix logging Status objects to 'authevents' channel

https://gerrit.wikimedia.org/r/965517

Change 965540 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] Fix logging Status objects to 'authevents' channel

https://gerrit.wikimedia.org/r/965540

Change 965541 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/WikimediaEvents@master] AuthManagerStatsdHandler: Remove support for Status object

https://gerrit.wikimedia.org/r/965541

It's unhelpful, but I think I wanted to say that monitoring is broken. That's done by AuthManagerStatsdHandler in WikimediaEvents, and it really expects a Status for the status parameter. The Monolog Logstash handler OTOH cannot handle a Status (for no good reason, IMO; if the parameter is stringifiable, it should just stringify it). There is more context in the code review for that patch.

The patches above take care of this (although I decided to remove the uses of Status objects rather than fix them: figuring out how to log them seems unpleasant, and most of the authentication code doesn't use them in the first place).

There are some authevents log events for central autologin, although not present on the authentication-metrics or authentications dashboards; that could be a start.

I had a look, and I think we could just add a bit more logging to SpecialCentralAutoLogin.php (currently it only logs autologin successes, but not failures), then add it to those dashboards. Does that sound good enough to resolve this task to you?

We should also check if all cross-wiki features that work without the user navigating to the other wiki (such as VisualEditor drag-and-drop file upload, or linking pages to Wikidata) use or fall back to the centralauthtoken API (which is a bit inconvenient but isn't affected by browser limitations).

centralauthtoken API is handled transparently by mw.ForeignApi (ext.centralauth.ForeignApi.js), which is probably used by everything that does cross-wiki API requests (because it's quite difficult to do them correctly otherwise, so no one even tried, as far as I know).

There is also EventLogging data via Schema:CentralAuth apparently.

This seems quite limited and it's using the legacy way of setting up schemas. I don't know where it even logs to. Can we just remove it? I don't really want to have multiple kinds of logging, and if we later decide we need EventLogging too, I'd rather set it up from scratch. See also T282131, which calls out this schema.

I had a look, and I think we could just add a bit more logging to SpecialCentralAutoLogin.php (currently it only logs autologin successes, but not failures), then add it to those dashboards. Does that sound good enough to resolve this task to you?

I think we should:

  • Count successful and failed central logins (Special:CentralLogin) via statsd. "Successful" is a weak measure here as the real success criteria is being able to set cookies and we can't really verify that, but still worth something.
  • Count successful and failed autologins. This requires differentiating between type=1x1 edge logins and type=1x1 autologins (used for non-JS clients; see the <noscript> part in PageDisplayHookHandler::onBeforePageDisplay()). We should also differentiate between top-level and subresource autologin.
  • Separately, count successful and failed edge logins.
  • On top of all that, log (at least with level=debug) every step of Special:CentralLogin and Special:CentralAutoLogin and also whenever they are triggered (LoginCompleteHookHandler::doCentralLoginRedirect(), CentralAuthHooks::getDomainAutoLoginHtml(), CentralAuthHooks::getEdgeLoginHTML(), PageDisplayHookHandler::onBeforePageDisplay(),

SpecialPageBeforeExecuteHookHandler::onSpecialPageBeforeExecute(), LoginCompleteHookHandler::onTempUserCreatedRedirect()). Include details about the user (username if we know it, browser UA). This would be very useful when looking into login problems reported by other people - we can just tell them to use the WikimediaDebug extension, and look up the data. It can also be useful for temporarily setting level=info to review the current state of browser support. This is not monitoring, so could be a separate task.

centralauthtoken API is handled transparently by mw.ForeignApi (ext.centralauth.ForeignApi.js), which is probably used by everything that does cross-wiki API requests (because it's quite difficult to do them correctly otherwise, so no one even tried, as far as I know).

Difficult to do them correctly but easy to do them incorrectly. In the past, I have definitely seen code just making cross-wiki requests, relying on session cookies and hoping for the best. Maybe it's all gone by now (I guess in any case hard to find since code-wise it wouldn't differ from a local AJAX request).

This seems quite limited and it's using the legacy way of setting up schemas. I don't know where it even logs to. Can we just remove it? I don't really want to have multiple kinds of logging, and if we later decide we need EventLogging too, I'd rather set it up from scratch. See also T282131, which calls out this schema.

Sure, let's remove it. (cc @phuedx) EventLogging data is pretty painful to use anyway. But I would still log to Logstash in those places, per above.

BTW there's also some pre-existing monitoring in SpecialCentralLogin::showError() which sends statd data for centralauth.centrallogin_errors.<error-message-key> stats. I think I'd rather use Logstash for that kind of thing. We did the same messagekey-based-statsd thing for the authevents channel, and never used it much. Seeing changes in volume are sometimes useful for security-wise interesting error types (like bad password or bad captcha) but central login doesn't really have anything like that. And Logstash still lets one visualise volumes with a little (but not that much) more effort.

Sure, let's remove it. (cc @phuedx) EventLogging data is pretty painful to use anyway. But I would still log to Logstash in those places, per above.

I know what I'm doing this Friday…

I've got plenty of bandwidth for code review right now so if you need help testing/landing those/future patches, then LMK!

BTW there's also some pre-existing monitoring in SpecialCentralLogin::showError() which sends statd data for centralauth.centrallogin_errors.<error-message-key> stats. I think I'd rather use Logstash for that kind of thing. We did the same messagekey-based-statsd thing for the authevents channel, and never used it much. Seeing changes in volume are sometimes useful for security-wise interesting error types (like bad password or bad captcha) but central login doesn't really have anything like that. And Logstash still lets one visualise volumes with a little (but not that much) more effort.

Are you saying we should remove it? (I'd be on board with that)

This data is used for some alerts: https://gerrit.wikimedia.org/g/operations/puppet/+/c678553ed26e63dbad2a0a0924f96962b6447a72/modules/profile/manifests/graphite/alerts.pp#47 – no idea who or where receives them.

That also links to a dashboard: https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 – which is probably redundant to https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1&viewPanel=13. No idea how to find if anything else uses the data.

Are you saying we should remove it? (I'd be on board with that)

I think we should either log SpecialCentral[Auto]Login steps to Logstash at the debug level (so people running into issues can use X-Wikimedia-Debug to provide a trace) and also log the final success/failure to statsd (so we can see trends over time), or we should log to Logastash at the info level, in which case we can also use that for trends/alerts (a little more painfully, OTOH with less code to maintain).

This data is used for some alerts:

That was added in rOPUP4dd4e50b6fbc: graphite::alerts: add alerting on session loss by @Joe so maybe he knows if it is still needed.

Change 968382 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] [WIP] Monitoring and debug logging for central logins

https://gerrit.wikimedia.org/r/968382

Change 968382 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Monitoring and debug logging for central logins

https://gerrit.wikimedia.org/r/968382

I added panels for autologin success/failure and autologin error types to https://grafana.wikimedia.org/d/000000004/authentication-metrics. (edit: and edge login)

image.png (4×3 px, 2 MB)

@Tgr @Krinkle Please have a look and check if I did that right, since I haven't done it before and I'm not sure what I'm doing.

I think this is the last step for this task?

Thanks, that looks great!

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

I think this is the last step for this task?

There's this bit:

There isn't really a way to differentiate between failures where the user has no active session on the login wiki and ones where that session exists but the browser prevents access to it (as the only difference is the existence of a cookie which the browser doesn't give access to) but we should log cases when it seems the user should have a central session (e.g. recently created account) but autologin fails.

Not sure how feasible it is, but it would be nice to get a feeling of how often browsers break central login / autologin / edge login. Currently "Not centrally logged in" covers both that and the case where the user is actually not centrally logged in.

Also, maybe worth adding an error breakdown for central login?

There's this bit:

There isn't really a way to differentiate between failures where the user has no active session on the login wiki and ones where that session exists but the browser prevents access to it (as the only difference is the existence of a cookie which the browser doesn't give access to) but we should log cases when it seems the user should have a central session (e.g. recently created account) but autologin fails.

Not sure how feasible it is, but it would be nice to get a feeling of how often browsers break central login / autologin / edge login. Currently "Not centrally logged in" covers both that and the case where the user is actually not centrally logged in.

You'll need to suggest some ideas for actually doing this, since I don't have any good ones. How do we know if they recently created an account if they're logged-out and we don't know their username? We could try matching IPs, but that seems icky, and I'm not sure if it would work that well anyway.

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

I'm not sure – it's not an error, but it is a failure. We should be able to notice if their number suddenly rises, for exactly the reason you mention: noticing any large-scale changes to browser behaviors. Maybe it should be another line on the chart, separated out from the rest of failures?

Also, maybe worth adding an error breakdown for central login?

I thought it would be mostly the same as login, I can add one though.


By the way, why do we have captcha metrics on the same dashboard? They seem mostly unrelated to me.

Also I've just noticed that the different charts show different units – some have rate per second, and some per minute. The "Login" chart, I'm pretty sure, has successes per minute and failures per second.

You'll need to suggest some ideas for actually doing this, since I don't have any good ones. How do we know if they recently created an account if they're logged-out and we don't know their username? We could try matching IPs, but that seems icky, and I'm not sure if it would work that well anyway.

I think the easy target here is top-level autologin which always ends on the starting wiki. (And central login since that's only done right after login.) You could also just do the boring stuff and add an URL parameter that just says "we expect this request to succeed".

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

I'm not sure – it's not an error, but it is a failure. We should be able to notice if their number suddenly rises, for exactly the reason you mention: noticing any large-scale changes to browser behaviors. Maybe it should be another line on the chart, separated out from the rest of failures?

An autologin is attempted every time an anonymous user visits a wiki (with a 1/day throttling), so it's influenced a lot by traffic patterns. I'd include it in the error breakdown but not include it in the success/failure chart, but YMMV.

(We should probably also exclude "normal" failures from the error rates chart, it's not very informative right now.)

Also, maybe worth adding an error breakdown for central login?

I thought it would be mostly the same as login, I can add one though.

Central login is a separate code path from both login and autologin/edge login.

By the way, why do we have captcha metrics on the same dashboard? They seem mostly unrelated to me.

They are for the signup captcha specifically. (Well, signup and login, but we almost never show a captcha on login.) The code path is authentication-related (it's an AuthManager plugin), and the rates are useful for understanding authentication-related issues (mainly when there is a huge signup spike due to some spambot).

Also I've just noticed that the different charts show different units – some have rate per second, and some per minute. The "Login" chart, I'm pretty sure, has successes per minute and failures per second.

Hm, good spot. Not sure what's up with that. Maybe someone started converting them to minute-based, and stopped halfway?

You'll need to suggest some ideas for actually doing this, since I don't have any good ones. How do we know if they recently created an account if they're logged-out and we don't know their username? We could try matching IPs, but that seems icky, and I'm not sure if it would work that well anyway.

I think the easy target here is top-level autologin which always ends on the starting wiki. (And central login since that's only done right after login.) You could also just do the boring stuff and add an URL parameter that just says "we expect this request to succeed".

I still don't get how you want to do this. How do we know that we expect this request to succeed?

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

I'm not sure – it's not an error, but it is a failure. We should be able to notice if their number suddenly rises, for exactly the reason you mention: noticing any large-scale changes to browser behaviors. Maybe it should be another line on the chart, separated out from the rest of failures?

An autologin is attempted every time an anonymous user visits a wiki (with a 1/day throttling), so it's influenced a lot by traffic patterns. I'd include it in the error breakdown but not include it in the success/failure chart, but YMMV.

(We should probably also exclude "normal" failures from the error rates chart, it's not very informative right now.)

Done, it's a separate line now.

Also, maybe worth adding an error breakdown for central login?

I thought it would be mostly the same as login, I can add one though.

Central login is a separate code path from both login and autologin/edge login.

Done.

Also I've just noticed that the different charts show different units – some have rate per second, and some per minute. The "Login" chart, I'm pretty sure, has successes per minute and failures per second.

Hm, good spot. Not sure what's up with that. Maybe someone started converting them to minute-based, and stopped halfway?

Done, all charts in the AuthManager section now show the rate per minute.

I think this really is resolved now. If there's anything else that needs to be done, it should happen in separate tasks.