Page MenuHomePhabricator

Invalidate all users sessions
Closed, ResolvedPublic

Description

We should invalidate all user sessions post T124409, for users who might've ended up leaving themselves logged in on shared computers. (Also T125283 although we are still not sure if that's real or not.)

Related Objects

Event Timeline

Reedy assigned this task to Anomie.
Reedy raised the priority of this task from to Unbreak Now!.
Reedy updated the task description. (Show Details)
Reedy added projects: WMF-General-or-Unknown, SRE.
Reedy added subscribers: Xaosflux, Hazard-SJ, Krenair and 22 others.
Legoktm subscribed.

Per IRC discussion we're going to just run the resetGlobalUserTokens.php script to force a reset of all user tokens, which will log everyone out.

Has this happened? I noticed I've gotten logged out everywhere sometime over the weekend, not sure if that was just a coincidence.

Has this happened? I noticed I've gotten logged out everywhere sometime over the weekend, not sure if that was just a coincidence.

@Legoktm was running it, yeah... Dunno if it's done yet

It's still running :/

Opened T124861 about that.

@Legoktm: Update, please?

Anyone, update please.

Sorry missed the ping. Yes, it's still going :(

This comment was removed by Redrose64.

It may have double-run - I have been force-logged out twice since this task was raised, the second one was today

These patches by @Anomie have been merged and are included in 1.27.0-wmf.12 which should hit all wikis in the next few hours. The major benefit of these patches is that they will give us a nearly instant switch to throw to invalidate all active sessions rather than the relying on the current system which takes days to run.

Invalidating all user sessions is not something we plan to do for fun certainly, but there have been in the past (and likely will be in the future) software and configuration issues such as T124409 that make forcing everyone to re-authenticate necessary. Being able to do so with a quick configuration change is a huge security win.

Terbium is scheduled for a reboot tonight (3am Pacific, 11am UTC):
https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160205T1100

Should we postpone that reboot due to this still running job?

Now that wmf.12 is everywhere, an alternative would be to just kill the job and set $wgAuthenticationTokenVersion to a non-null value in our config to make sure all sessions are invalidated. Unless we think we might need to roll back to wmf.10 again, anyway.

It probably doesn't matter if we do it in CommonSettings.php or in the private settings since that's the "public" part of the hmac, but ask @csteipp if he has an opinion.

Unless we think we might need to roll back to wmf.10 again, anyway.

To act in an abundance of caution, I would say wait until wmf.10 is not the target to rollback to (iow: once we get to Tuesday afternoon next week with wmf.13 on group0, so wmf.12 is the "previous" branch).

It probably doesn't matter if we do it in CommonSettings.php or in the private settings since that's the "public" part of the hmac, but ask @csteipp if he has an opinion.

I was assuming that would be public.

Now that wmf.12 is everywhere, an alternative would be to just kill the job and set $wgAuthenticationTokenVersion to a non-null value in our config to make sure all sessions are invalidated.

This would log everyone out again, right? Is it possible to check how far the script is from completing the work?

About 30 minutes ago it was at gu_id 10467742. There are 45463166 total users.

These patches by @Anomie have been merged and are included in 1.27.0-wmf.12 which should hit all wikis in the next few hours. The major benefit of these patches is that they will give us a nearly instant switch to throw to invalidate all active sessions rather than the relying on the current system which takes days to run.

Invalidating all user sessions is not something we plan to do for fun certainly, but there have been in the past (and likely will be in the future) software and configuration issues such as T124409 that make forcing everyone to re-authenticate necessary. Being able to do so with a quick configuration change is a huge security win.

Using that the first time is somewhat insecure (if the attacker knows the old token, and the version is public, they can calculate the new token). So we should either set it now (and accept that 25% of the users are going to be logged out twice) or make the version non-public (or throw a non-public constant salt into the mix).

Adding @MoritzMuehlenhoff regarding the reboot

Ok, I've moved this to Monday (same time) on the Deployments page. Will send a mail to ops@ shortly.

Ok, I've moved this to Monday (same time) on the Deployments page. Will send a mail to ops@ shortly.

@Legoktm/@Tgr: is that enough time? Or should this be stopped now (to limit the number of people double logged out) and we use @Anomie's new system on Tuesday?

Per SAL, the script was started at 2016-01-31 05:35, and Legoktm says at 2016-02-04 11:04 it was at 10467742 users out of 45463166 (23%). That's 10467742 users in 101.5 hours so the full process can be expected to take 441 hours. If the script finishes without further interruptions, it would do so about two weeks from now.

So, IMO:

  • stop this now
  • set wgAuthenticationTokenVersion on Monday (so that we can be sure any reports about being logged out on Tuesday are genuine bugs)
  • maybe come up with a new script for resetting user tokens that is faster and run it (over the weekend, or after the rollout is finished).

About the last point: the two bugs that made us run the script leaked user tokens; wgAuthenticationTokenVersion does not protect against leaked tokens. It does prevent leaking tokens once it is set, and taking advantage of leaked tokens will be harder and require technical skills once it is set (the attacker would have to be lucky enough to have obtained a token cookie due to one of those two bugs, and then perform the same transformation on the cookie that Anomie's patch does). Not sure if covering that minimal attack surface justifies the disruption of logging everyone out twice. Maybe @csteipp has an opinion on that.

Per SAL, the script was started at 2016-01-31 05:35, and Legoktm says at 2016-02-04 11:04 it was at 10467742 users out of 45463166 (23%). That's 10467742 users in 101.5 hours so the full process can be expected to take 441 hours. If the script finishes without further interruptions, it would do so about two weeks from now.

So, IMO:

  • stop this now

I am very pro this pending clarification on your last question for Chris.

@Tgr, you lost me a little in this, but let me try to respond.

About the last point: the two bugs that made us run the script leaked user tokens; wgAuthenticationTokenVersion does not protect against leaked tokens. It does prevent leaking tokens once it is set, and taking advantage of leaked tokens will be harder and require technical skills once it is set (the attacker would have to be lucky enough to have obtained a token cookie due to one of those two bugs, and then perform the same transformation on the cookie that Anomie's patch does). Not sure if covering that minimal attack surface justifies the disruption of logging everyone out twice. Maybe @csteipp has an opinion on that.

We're currently resetting tokens because we know that some users had their sessions confused, and potentially had their session token readable by other user. It's our duty to protect those users, so we need to invalidate all sessions that might have leaked, regardless of how we do the reset. So yes, I think it is "worth it" to log everyone out, unless we have a reasonable idea of who all of our potential victims are and we can just reset those sessions (but, aiui, we don't have that).

I think eventually we should either set a wgAuthenticationTokenVersion, or handle T50698 another way that doesn't involve logging everyone out. But that's just hardening, and there's no rush to do it now, so I don't think that should be a driver for doing the reset with wgAuthenticationTokenVersion.

I think it is "worth it" to log everyone out

Sure. My point is, setting $wgAuthenticationTokenVersion *will* log everyone out, but you can log back in by manipulating your token cookie the right way. Would you consider that good enough? If not, we should reset the tokens in the database; stopping, rewriting and restarting resetGlobalUserTokens.php is probably the better way of doing that.

(FWIW $wgAuthenticationTokenVersion doesn't help with T50698 - if you obtain DB access, you will be able to abuse the tokens even if the token version is changed. Also, if we don't set $wgAuthenticationTokenVersion now, we will be in the same trouble the next time we would need it - that initially setting it does not undo a token cookie disclosure if the attacker is clever enough to figure out how to manipulate the cookie. Since we have already logged people out multiple times in the last few days, we should set it now and get it over with - setting it at a different time would be more disruptive.)

Sure. My point is, setting $wgAuthenticationTokenVersion *will* log everyone out, but you can log back in by manipulating your token cookie the right way. Would you consider that good enough? If not, we should reset the tokens in the database; stopping, rewriting and restarting resetGlobalUserTokens.php is probably the better way of doing that.

If the plan is to use a *public* $wgAuthenticationTokenVersion, then no, that will not fulfill what we're trying to do here. So in that case, no, we shouldn't do it now. If we're using a long, random, private $wgAuthenticationTokenVersion, then that's probably sufficient in light of the risk in this particular situation.

Per IRC discussion, we stopped the script and will set a secret random $wgAuthenticationTokenVersion value.

The plan is to set $wgAuthenticationTokenVersion tonight. The time was chosed over IRC based on four factors: 1) someone from ops should be around, just in case something goes wrong and cleanup is needed, 2) it should happen when the least people are editing (stats), to minimize lost edits, accidentally recorded IP addresses and similar disruption, 3) it should not happen during the SessionManager deployment when it can mask real errors that log users out, 4) it's a security response so sooner is better than later.

Some notes:

  • $wgAuthenticationTokenVersion will be set in PrivateSettings.php to a long random string
  • this was already done a day ago on beta; lesson is that HHVM in web mode apparently does not pick up changes to symlinked files (T126306) so we'll have to edit the file, touch the symlink, then sync-file both.
  • Can be rolled back smoothly: if the setting is removed, logged-out users will be logged-in again. (Except those who have logged in again manually, of course.)
  • apart from confusing users, possible fallout is that bots which do not implement assert=user will continue to happily edit anonymously.

Sent a notification to same places as T126069#2004474.

Wondering if T126322: Failing wikitech logins is related or whether timing is just a coincidence?

Wondering if T126322: Failing wikitech logins is related or whether timing is just a coincidence?

OpenStackNovaController::authenticate return code: 401? doubt it's related

  • apart from confusing users, possible fallout is that bots which do not implement assert=user will continue to happily edit anonymously.

Well, bots that don't use assert=user and don't detect being logged out in some other way (e.g. receiving an edit token of "+\"). But this isn't terribly uncommon since sessions and token-cookies can and do expire for other reasons.

  1. it should happen when the least people are editing (stats), to minimize lost edits, accidentally recorded IP addresses and similar disruption

The error messages about "loss of session data" are really lame. Could we merge and deploy the patch for T125334: "loss of session data" in error messages is meaningless to the user earlier today to try to avoid these issues further?

Wondering if T126322: Failing wikitech logins is related or whether timing is just a coincidence?

If something went really bad (e.g. PHP saw the setting in some contexts but not others) that would result in a "you have successfully authenticated but looks like you have cookies disabled" type of message. I can't think of any way in which a session-related setting could cause the login code to think that the authentication was unsuccessful.

@Billinghurst reports that autologin does not work for him (he needs to log in separately on every wiki).

I guess there is not much point in trying to track down issues with the old CentralAuth code now.

The error messages about "loss of session data" are really lame. Could we merge and deploy the patch for T125334: "loss of session data" in error messages is meaningless to the user earlier today to try to avoid these issues further?

SWATted. Can you follow up later on Siebrand's issue?

...didn't make it yet into wmf.13 though.

Might be relevant: A user in the Hebrew Wikipedia wrote that he became logged out of his desktop account, but remained logged-in in the mobile app.

Might be relevant: A user in the Hebrew Wikipedia wrote that he became logged out of his desktop account, but remained logged-in in the mobile app.

The mobile app knows your password so it probably just transparently logged him back again.