Improve (or identify) monitoring for CentralAuth autologins on Wikimedia wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tgr
	Jan 16 2023, 5:16 AM

Description

Currently CentralAuth autologin (where users who manually logged in or registered on one wiki get seamlessly logged in on another wiki) is a nice to have; if it fails (which is not uncommon), the user can just enter their password and log in the usual way. For temporary accounts autologin will be the only way to access their account on another wiki, though. If it doesn't work, they will be unable to affect other wikis (e.g. upload files to Commons, or link pages to Wikidata). Of course they can just register a normal account, but we should at least be aware to what extent temporary accounts are crippled.

We should log autologin failures in some way (or identify existing logging). There isn't really a way to differentiate between failures where the user has no active session on the login wiki and ones where that session exists but the browser prevents access to it (as the only difference is the existence of a cookie which the browser doesn't give access to) but we should log cases when it seems the user should have a central session (e.g. recently created account) but autologin fails.

This would also help with monitoring the effects of T326281: Attempt top-level central autologin when visiting the login page (to allow autologin when the browser blocks third-party cookies).

Details

	Subject	Repo	Branch	Lines +/-
	Monitoring and debug logging for central logins	mediawiki/extensions/CentralAuth	master	+100 -20

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
In Progress	• Niharika	T324492 Temporary accounts - MVP
Open	None	T326816 Update features for IP Masking
Open	None	T348206 Improve logging, monitoring and test coverage for MediaWiki Platform team authentication extensions
Resolved	matmarex	T327046 Improve (or identify) monitoring for CentralAuth autologins on Wikimedia wikis
Resolved	matmarex	T275085 Autocreate authevents log entries look odd
Resolved	matmarex	T349005 Decommission Schema:CentralAuth

Event Timeline

Tgr created this task.Jan 16 2023, 5:16 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2023, 5:16 AM

Tgr mentioned this in T326281: Attempt top-level central autologin when visiting the login page (to allow autologin when the browser blocks third-party cookies).Jan 16 2023, 5:17 AM

There are some authevents log events for central autologin, although not present on the authentication-metrics or authentications dashboards; that could be a start.

We should also check if all cross-wiki features that work without the user navigating to the other wiki (such as VisualEditor drag-and-drop file upload, or linking pages to Wikidata) use or fall back to the centralauthtoken API (which is a bit inconvenient but isn't affected by browser limitations).

MShilova_WMF triaged this task as Medium priority.Jan 16 2023, 7:56 PM

MShilova_WMF edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.

MShilova_WMF moved this task from Incoming to Top Product Priorities on the Growth-Team (Sprint 0 (Growth Team)) board.

MShilova_WMF added a parent task: T326877: [Epic] Update Growth Team-owned products that may be affected by IP Masking.Jan 19 2023, 5:05 PM

• Niharika moved this task from Backlog to Needs Other Product Teams on the Temporary accounts board.Jan 25 2023, 7:04 PM

KStoller-WMF moved this task from Top Product Priorities to Ready for Development on the Growth-Team (Sprint 0 (Growth Team)) board.Jan 30 2023, 3:01 AM

@Tgr Any thoughts on if this should stay in Growth's current sprint or if this might be owned by the MediaWiki Core team in the future?

pmiazga assigned this task to Hokwelum.Jul 18 2023, 4:55 PM

pmiazga updated Other Assignee, added: pmiazga.

larissagaulia reassigned this task from Hokwelum to matmarex.Aug 28 2023, 1:45 PM

larissagaulia added a subscriber: Hokwelum.

Restricted Application added a project: MediaWiki-Platform-Team. · View Herald TranscriptAug 28 2023, 1:45 PM

There is also EventLogging data via Schema:CentralAuth apparently.

Urbanecm_WMF added a project: IP-Masking-Growth-Team.Aug 29 2023, 2:07 PM

Autocreation error logging is broken since rMW734f0c23e377: update authevents logging status context to use string representation directly. We should probably fix that.

larissagaulia moved this task from Inbox, needs triage to Current Sprint on the MediaWiki-Platform-Team board.Sep 18 2023, 1:24 PM

pmiazga updated Other Assignee, removed: pmiazga.Sep 20 2023, 2:42 PM

In T327046#9154369, @Tgr wrote:

Autocreation error logging is broken since rMW734f0c23e377: update authevents logging status context to use string representation directly. We should probably fix that.

Broken how? I found some log entries generated from that code and reporting autocreation errors: https://logstash.wikimedia.org/goto/60cb2963281f561f437934c1ae0eb78b

Or do you mean that the format is too creative?

It's unhelpful, but I think I wanted to say that monitoring is broken. That's done by AuthManagerStatsdHandler in WikimediaEvents, and it really expects a Status for the status parameter. The Monolog Logstash handler OTOH cannot handle a Status (for no good reason, IMO; if the parameter is stringifiable, it should just stringify it). There is more context in the code review for that patch.

Tgr mentioned this in T347857: invalid returnUrlToken on MediaWiki.org after logging out.Oct 3 2023, 12:00 AM

Moving this out of the Growth boards (and altering subtasking), as CentralAuth is not a part of Growth responsibilities. If Growth is needed here, please let us know.

Change 965517 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] Fix logging Status objects to 'authevents' channel

https://gerrit.wikimedia.org/r/965517

gerritbot added a project: Patch-For-Review.Oct 12 2023, 1:47 PM

Change 965540 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] Fix logging Status objects to 'authevents' channel

https://gerrit.wikimedia.org/r/965540

Change 965541 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/WikimediaEvents@master] AuthManagerStatsdHandler: Remove support for Status object

https://gerrit.wikimedia.org/r/965541

In T327046#9212929, @Tgr wrote:

It's unhelpful, but I think I wanted to say that monitoring is broken. That's done by AuthManagerStatsdHandler in WikimediaEvents, and it really expects a Status for the status parameter. The Monolog Logstash handler OTOH cannot handle a Status (for no good reason, IMO; if the parameter is stringifiable, it should just stringify it). There is more context in the code review for that patch.

The patches above take care of this (although I decided to remove the uses of Status objects rather than fix them: figuring out how to log them seems unpleasant, and most of the authentication code doesn't use them in the first place).

In T327046#8527000, @Tgr wrote:

There are some authevents log events for central autologin, although not present on the authentication-metrics or authentications dashboards; that could be a start.

I had a look, and I think we could just add a bit more logging to SpecialCentralAutoLogin.php (currently it only logs autologin successes, but not failures), then add it to those dashboards. Does that sound good enough to resolve this task to you?

We should also check if all cross-wiki features that work without the user navigating to the other wiki (such as VisualEditor drag-and-drop file upload, or linking pages to Wikidata) use or fall back to the centralauthtoken API (which is a bit inconvenient but isn't affected by browser limitations).

centralauthtoken API is handled transparently by mw.ForeignApi (ext.centralauth.ForeignApi.js), which is probably used by everything that does cross-wiki API requests (because it's quite difficult to do them correctly otherwise, so no one even tried, as far as I know).

In T327046#9124480, @Tgr wrote:

There is also EventLogging data via Schema:CentralAuth apparently.

This seems quite limited and it's using the legacy way of setting up schemas. I don't know where it even logs to. Can we just remove it? I don't really want to have multiple kinds of logging, and if we later decide we need EventLogging too, I'd rather set it up from scratch. See also T282131, which calls out this schema.

In T327046#9249785, @matmarex wrote:

I had a look, and I think we could just add a bit more logging to SpecialCentralAutoLogin.php (currently it only logs autologin successes, but not failures), then add it to those dashboards. Does that sound good enough to resolve this task to you?

I think we should:

Count successful and failed central logins (Special:CentralLogin) via statsd. "Successful" is a weak measure here as the real success criteria is being able to set cookies and we can't really verify that, but still worth something.
Count successful and failed autologins. This requires differentiating between type=1x1 edge logins and type=1x1 autologins (used for non-JS clients; see the <noscript> part in PageDisplayHookHandler::onBeforePageDisplay()). We should also differentiate between top-level and subresource autologin.
Separately, count successful and failed edge logins.
On top of all that, log (at least with level=debug) every step of Special:CentralLogin and Special:CentralAutoLogin and also whenever they are triggered (LoginCompleteHookHandler::doCentralLoginRedirect(), CentralAuthHooks::getDomainAutoLoginHtml(), CentralAuthHooks::getEdgeLoginHTML(), PageDisplayHookHandler::onBeforePageDisplay(),

SpecialPageBeforeExecuteHookHandler::onSpecialPageBeforeExecute(), LoginCompleteHookHandler::onTempUserCreatedRedirect()). Include details about the user (username if we know it, browser UA). This would be very useful when looking into login problems reported by other people - we can just tell them to use the WikimediaDebug extension, and look up the data. It can also be useful for temporarily setting level=info to review the current state of browser support. This is not monitoring, so could be a separate task.

centralauthtoken API is handled transparently by mw.ForeignApi (ext.centralauth.ForeignApi.js), which is probably used by everything that does cross-wiki API requests (because it's quite difficult to do them correctly otherwise, so no one even tried, as far as I know).

Difficult to do them correctly but easy to do them incorrectly. In the past, I have definitely seen code just making cross-wiki requests, relying on session cookies and hoping for the best. Maybe it's all gone by now (I guess in any case hard to find since code-wise it wouldn't differ from a local AJAX request).

This seems quite limited and it's using the legacy way of setting up schemas. I don't know where it even logs to. Can we just remove it? I don't really want to have multiple kinds of logging, and if we later decide we need EventLogging too, I'd rather set it up from scratch. See also T282131, which calls out this schema.

Sure, let's remove it. (cc @phuedx) EventLogging data is pretty painful to use anyway. But I would still log to Logstash in those places, per above.

BTW there's also some pre-existing monitoring in SpecialCentralLogin::showError() which sends statd data for centralauth.centrallogin_errors.<error-message-key> stats. I think I'd rather use Logstash for that kind of thing. We did the same messagekey-based-statsd thing for the authevents channel, and never used it much. Seeing changes in volume are sometimes useful for security-wise interesting error types (like bad password or bad captcha) but central login doesn't really have anything like that. And Logstash still lets one visualise volumes with a little (but not that much) more effort.

In T327046#9252219, @Tgr wrote:

Sure, let's remove it. (cc @phuedx) EventLogging data is pretty painful to use anyway. But I would still log to Logstash in those places, per above.

I know what I'm doing this Friday…

I've got plenty of bandwidth for code review right now so if you need help testing/landing those/future patches, then LMK!

matmarex added a parent task: T348206: Improve logging, monitoring and test coverage for MediaWiki Platform team authentication extensions.Oct 16 2023, 2:39 PM

matmarex added a subtask: T275085: Autocreate authevents log entries look odd.Oct 16 2023, 3:37 PM

matmarex mentioned this in T349005: Decommission Schema:CentralAuth.Oct 16 2023, 3:43 PM

matmarex added a subtask: T349005: Decommission Schema:CentralAuth.

In T327046#9252499, @Tgr wrote:

BTW there's also some pre-existing monitoring in SpecialCentralLogin::showError() which sends statd data for centralauth.centrallogin_errors.<error-message-key> stats. I think I'd rather use Logstash for that kind of thing. We did the same messagekey-based-statsd thing for the authevents channel, and never used it much. Seeing changes in volume are sometimes useful for security-wise interesting error types (like bad password or bad captcha) but central login doesn't really have anything like that. And Logstash still lets one visualise volumes with a little (but not that much) more effort.

Are you saying we should remove it? (I'd be on board with that)

This data is used for some alerts: https://gerrit.wikimedia.org/g/operations/puppet/+/c678553ed26e63dbad2a0a0924f96962b6447a72/modules/profile/manifests/graphite/alerts.pp#47 – no idea who or where receives them.

That also links to a dashboard: https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 – which is probably redundant to https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1&viewPanel=13. No idea how to find if anything else uses the data.

matmarex closed subtask T349005: Decommission Schema:CentralAuth as Resolved.Oct 19 2023, 11:34 AM

matmarex closed subtask T275085: Autocreate authevents log entries look odd as Resolved.Oct 21 2023, 5:40 PM

In T327046#9262816, @matmarex wrote:

Are you saying we should remove it? (I'd be on board with that)

I think we should either log SpecialCentral[Auto]Login steps to Logstash at the debug level (so people running into issues can use X-Wikimedia-Debug to provide a trace) and also log the final success/failure to statsd (so we can see trends over time), or we should log to Logastash at the info level, in which case we can also use that for trends/alerts (a little more painfully, OTOH with less code to maintain).

This data is used for some alerts:

That was added in rOPUP4dd4e50b6fbc: graphite::alerts: add alerting on session loss by @Joe so maybe he knows if it is still needed.

Tgr mentioned this in T348206: Improve logging, monitoring and test coverage for MediaWiki Platform team authentication extensions.Oct 23 2023, 3:09 AM

Change 968382 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] [WIP] Monitoring and debug logging for central logins

https://gerrit.wikimedia.org/r/968382

matmarex mentioned this in T349745: Remove unused CentralAuth code identified by new monitoring.Oct 25 2023, 6:11 PM

Filed T350094: Enable verbose logging without installing the WikimediaDebug extension about making it easier to use this for diagnosing problems.

Change 968382 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Monitoring and debug logging for central logins

https://gerrit.wikimedia.org/r/968382

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.3; 2023-10-31).Oct 31 2023, 1:00 AM

I added panels for autologin success/failure and autologin error types to https://grafana.wikimedia.org/d/000000004/authentication-metrics. (edit: and edge login)

@Tgr @Krinkle Please have a look and check if I did that right, since I haven't done it before and I'm not sure what I'm doing.

I think this is the last step for this task?

matmarex attached a referenced file: F41436153: image.png. (Show Details)Nov 3 2023, 3:04 AM

Thanks, that looks great!

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

In T327046#9303968, @matmarex wrote:

I think this is the last step for this task?

There's this bit:

There isn't really a way to differentiate between failures where the user has no active session on the login wiki and ones where that session exists but the browser prevents access to it (as the only difference is the existence of a cookie which the browser doesn't give access to) but we should log cases when it seems the user should have a central session (e.g. recently created account) but autologin fails.

Not sure how feasible it is, but it would be nice to get a feeling of how often browsers break central login / autologin / edge login. Currently "Not centrally logged in" covers both that and the case where the user is actually not centrally logged in.

Also, maybe worth adding an error breakdown for central login?

In T327046#9304064, @Tgr wrote:

There's this bit:

There isn't really a way to differentiate between failures where the user has no active session on the login wiki and ones where that session exists but the browser prevents access to it (as the only difference is the existence of a cookie which the browser doesn't give access to) but we should log cases when it seems the user should have a central session (e.g. recently created account) but autologin fails.

Not sure how feasible it is, but it would be nice to get a feeling of how often browsers break central login / autologin / edge login. Currently "Not centrally logged in" covers both that and the case where the user is actually not centrally logged in.

You'll need to suggest some ideas for actually doing this, since I don't have any good ones. How do we know if they recently created an account if they're logged-out and we don't know their username? We could try matching IPs, but that seems icky, and I'm not sure if it would work that well anyway.

In T327046#9304061, @Tgr wrote:

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

I'm not sure – it's not an error, but it is a failure. We should be able to notice if their number suddenly rises, for exactly the reason you mention: noticing any large-scale changes to browser behaviors. Maybe it should be another line on the chart, separated out from the rest of failures?

Also, maybe worth adding an error breakdown for central login?

I thought it would be mostly the same as login, I can add one though.

By the way, why do we have captcha metrics on the same dashboard? They seem mostly unrelated to me.

Also I've just noticed that the different charts show different units – some have rate per second, and some per minute. The "Login" chart, I'm pretty sure, has successes per minute and failures per second.

In T327046#9306324, @matmarex wrote:

You'll need to suggest some ideas for actually doing this, since I don't have any good ones. How do we know if they recently created an account if they're logged-out and we don't know their username? We could try matching IPs, but that seems icky, and I'm not sure if it would work that well anyway.

I think the easy target here is top-level autologin which always ends on the starting wiki. (And central login since that's only done right after login.) You could also just do the boring stuff and add an URL parameter that just says "we expect this request to succeed".

In T327046#9304061, @Tgr wrote:

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

I'm not sure – it's not an error, but it is a failure. We should be able to notice if their number suddenly rises, for exactly the reason you mention: noticing any large-scale changes to browser behaviors. Maybe it should be another line on the chart, separated out from the rest of failures?

An autologin is attempted every time an anonymous user visits a wiki (with a 1/day throttling), so it's influenced a lot by traffic patterns. I'd include it in the error breakdown but not include it in the success/failure chart, but YMMV.

(We should probably also exclude "normal" failures from the error rates chart, it's not very informative right now.)

Also, maybe worth adding an error breakdown for central login?

I thought it would be mostly the same as login, I can add one though.

Central login is a separate code path from both login and autologin/edge login.

By the way, why do we have captcha metrics on the same dashboard? They seem mostly unrelated to me.

They are for the signup captcha specifically. (Well, signup and login, but we almost never show a captcha on login.) The code path is authentication-related (it's an AuthManager plugin), and the rates are useful for understanding authentication-related issues (mainly when there is a huge signup spike due to some spambot).

In T327046#9306349, @matmarex wrote:

Also I've just noticed that the different charts show different units – some have rate per second, and some per minute. The "Login" chart, I'm pretty sure, has successes per minute and failures per second.

Hm, good spot. Not sure what's up with that. Maybe someone started converting them to minute-based, and stopped halfway?

matmarex removed a project: Patch-For-Review.Nov 16 2023, 4:27 PM

In T327046#9306949, @Tgr wrote:

In T327046#9306324, @matmarex wrote:

You'll need to suggest some ideas for actually doing this, since I don't have any good ones. How do we know if they recently created an account if they're logged-out and we don't know their username? We could try matching IPs, but that seems icky, and I'm not sure if it would work that well anyway.

I think the easy target here is top-level autologin which always ends on the starting wiki. (And central login since that's only done right after login.) You could also just do the boring stuff and add an URL parameter that just says "we expect this request to succeed".

I still don't get how you want to do this. How do we know that we expect this request to succeed?

In T327046#9304061, @Tgr wrote:

I think we should exclude "Not centrally logged in" from the success/failure stats (like we do for another not-really-error for login), it's normal behavior.

I'm not sure – it's not an error, but it is a failure. We should be able to notice if their number suddenly rises, for exactly the reason you mention: noticing any large-scale changes to browser behaviors. Maybe it should be another line on the chart, separated out from the rest of failures?

An autologin is attempted every time an anonymous user visits a wiki (with a 1/day throttling), so it's influenced a lot by traffic patterns. I'd include it in the error breakdown but not include it in the success/failure chart, but YMMV.

(We should probably also exclude "normal" failures from the error rates chart, it's not very informative right now.)

Done, it's a separate line now.

Also, maybe worth adding an error breakdown for central login?

I thought it would be mostly the same as login, I can add one though.

Central login is a separate code path from both login and autologin/edge login.

Done.

In T327046#9306349, @matmarex wrote:

Also I've just noticed that the different charts show different units – some have rate per second, and some per minute. The "Login" chart, I'm pretty sure, has successes per minute and failures per second.

Hm, good spot. Not sure what's up with that. Maybe someone started converting them to minute-based, and stopped halfway?

Done, all charts in the AuthManager section now show the rate per minute.

I think this really is resolved now. If there's anything else that needs to be done, it should happen in separate tasks.

matmarex mentioned this in T351948: Remove redundant metrics "MediaWiki.centralauth.centrallogin_errors.*".Nov 25 2023, 3:26 AM

matmarex mentioned this in T68828: CentralAuth: Audit autologin procedure for performance and code quality.Jan 17 2024, 3:53 AM

	F41435620: image.png
	Nov 3 2023, 1:30 AM

	F37861386: image.png
	Sep 30 2023, 3:22 AM

Improve (or identify) monitoring for CentralAuth autologins on Wikimedia wikisClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Improve (or identify) monitoring for CentralAuth autologins on Wikimedia wikis
Closed, ResolvedPublic
Actions

Related Objects
Search...