Page MenuHomePhabricator

Remove local wiki password hash when CentralAuth has attached account
Closed, ResolvedPublicPRODUCTION ERROR

Description

Having local wikis store password hashes and tokens of accounts that authenticate against CentralAuth is an unnecessary liability, if that wiki has their user table made public.

Obviously, if the account is detached, we need the local hash rewritten.

The hash is include under two circumstances:

  • When a user attaches an account to CentralAuth, the local wiki's password hash remains.
  • If a user logs into a wiki where they don't have an account (global or local), using their CentralAuth credentials, the password hash is stored in the local wiki's database.

We should be able to remove the local hash on login, and could probably provide a maintenance script too, although preventing a possible race condition with the account being detached is problematic.


Version: master
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:13 AM
bzimport set Reference to bz55420.
bzimport added a subscriber: Unknown Object (MLST).
Tgr added a subtask: Restricted Task.Nov 17 2016, 5:04 AM
Tgr changed the status of subtask Restricted Task from Open to Stalled.

@Tgr So since SUL is finalized and there should be no more unattached accounts (from what I understand), can we set loginOnly to the local password primary authentication thingy with no ill effects?

is the approximate list of unattached accounts, from T141160#2489124. (Would not include a local account that's not tracked at all in the centralauth table, although hopefully there is no such thing.) I think loginOnly can be enabled anyway after https://gerrit.wikimedia.org/r/#/c/321602 (which prevents unattached users from locking themselves out by setting up 2FA (well, assuming that 2FA will use the AuthManager data change mechanism at some point)).

Note that there is a third circumstance where a local PW hash is created/changed: global password change (only on the wiki where the password change is performed). loginOnly should remove that as well.

Bumppppppp. We should do this, the patch landed ages ago :)

Change 409638 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on beta

https://gerrit.wikimedia.org/r/409638

Change 409638 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on beta

https://gerrit.wikimedia.org/r/409638

Mentioned in SAL (#wikimedia-operations) [2018-02-22T00:25:11Z] <tgr@tin> Synchronized wmf-config/CommonSettings-labs.php: T57420 enable loginOnly flag in beta (duration: 01m 12s)

Deployed to beta, verified that local password is cleared on password change, and left empty on autocreation.

Next step would be to deploy to production, then have a script nuke local passwords for all attached accounts (and maybe make some kind of backup before).

Is there anything else we might want to test first?

Change 416331 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on group 0

https://gerrit.wikimedia.org/r/416331

Change 416331 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on group 0

https://gerrit.wikimedia.org/r/416331

Mentioned in SAL (#wikimedia-operations) [2018-03-06T00:54:16Z] <tgr@tin> Synchronized wmf-config: T57420 Enable loginOnly mode for local auth provider on group 0 (duration: 01m 00s)

Change 416630 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on group 1

https://gerrit.wikimedia.org/r/416630

Change 416631 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on group 2

https://gerrit.wikimedia.org/r/416631

Change 416630 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on group 1

https://gerrit.wikimedia.org/r/416630

Mentioned in SAL (#wikimedia-operations) [2018-03-07T18:26:37Z] <tgr@tin> Synchronized wmf-config/InitialiseSettings.php: T57420 Enable loginOnly mode for local auth provider on group 1 (duration: 01m 20s)

Change 416631 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable loginOnly mode for local auth provider on group 2

https://gerrit.wikimedia.org/r/416631

Mentioned in SAL (#wikimedia-operations) [2018-03-08T00:16:45Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:416631|Enable loginOnly mode for local auth provider on group 2]] T57420 (duration: 01m 16s)

Tgr closed subtask Restricted Task as Declined.Mar 22 2018, 5:52 PM

Change 421376 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Add maintenance script for deleting local passwords

https://gerrit.wikimedia.org/r/421376

Change 421377 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CentralAuth@master] Maintenance script to delete the local passwords of central accounts

https://gerrit.wikimedia.org/r/421377

Change 421376 merged by jenkins-bot:
[mediawiki/core@master] Add maintenance script for deleting local passwords

https://gerrit.wikimedia.org/r/421376

Change 421377 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@master] Maintenance script to delete the local passwords of central accounts

https://gerrit.wikimedia.org/r/421377

It seems this is running right now (learned from viewing logs). It would be nice to having gotten a ping (we will never block maintenace, but we ask to communicate it on the deployments page as it can interfere with other maintenance, or in this case, we could compress the tables after data being purged).

I got asked at the SRE meeting to ask the "owner" of who was running this or who code it to create an incident report to clarify what happened and what actionables can be done to avoid it in the future. We (DBAs) can help with the attending part and timeline, but given it was mostly a code issue, I would prefer if we just helped someone with better code/mediawiki knowledge to create a proper report with a better analysis of what went bad.

I wrote the code and @Ladsgroup ran it. The code issue was a basic oversight - I forgot to add the standard "wait for replication" call. That did not cause problems on other wikis because normally this script would go through all user records but only change a very small number of them (the ones that are not attached to a CentralAuth account, usually either because they are some kind of system user or due to bugs), yet somehow on Commons it updated the records of 1617150 users (before getting killed, but judging from the affected usernames, it finished or got very close to finishing) - that's over 20% of total users. That still needs investigation.

In terms of future actionables, other than being more careful, not sure. The code is fixed now; this was a bad mistake and I should have known better, but I don't see any easy way to add some kind of automated test.

@Tgr "In terms of future actionables, other than being more careful, not sure" That would look to me more than acceptable as an incident report summary in my humble opinion, I am just passing what I was required to make sure is done by my manager (this created unscheduled read only time, larger than a datacenter switch, so I think he is right on being able to explain that (without blaming anyone)- Please copy your last comment into an incident report on wikitech and I will fill in the "what was done" at the lagging time by initial responders (volans, joe, me). If there is issues because security, please tell me so I can tell that. Maybe this is not that remarkable, but maybe we have further issues later and it is great to have some background of past issues.

On the opinion side, I think the "ping the DBAs"/adding it to Deployments "week of" would be a worthwhile actionable, as I asked for on a comment on Friday.

I !loged it in SAL and did ran the script in smaller scale before going full production: T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere
I also gave a heads up in DBA channel but it wasn't enough and next time I will add it to the deployment calendar. It didn't suppose to take this long.

That did not cause problems on other wikis because normally this script would go through all user records but only change a very small number of them

Uhh that's actually backwards. It will change all records except the unattached ones. In that case the number is too low? Anyhow, needs investigation.

@jcrespo sure, there was an outage, there should be an incident report, I'm not disagreeing, I'm just saying this was one of those one-off human stupidity errors that can't be prevented (but hopefully don't happen often :). Although on second thought maybe it can - filed T205893: Automatically trigger waitForReplication after a sufficiently high number of rows has been written about that.

Anyhow incident documentation is at https://wikitech.wikimedia.org/wiki/Incident_documentation/20181001-DeleteLocalPasswords , please expand the timeline.

I've added the following:

  • A more complete timeline from our (operations) perspective
  • A more accurate description so "outage" was specifically described
  • Indication that dewiki was also affected, just not so much that it didn't page, but was shown on graphs and eqiad alerts
  • Suggestion of actionables (some may be redundant or answer to questions on conclusions, feel free to improve that/change as you want)

Since T201009 is resolved, we can call this done.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM