Page MenuHomePhabricator

User authentication security issue (Oct 1, 2020)
Closed, ResolvedPublic

Description

This is a public placeholder task for the Oct 1. security issue related to user authentication. Investigation is ongoing, details will be shared later. So far we have not seen any indication that the issue would be intentional or widespread. Out of an abundance of caution, we have logged out all users (at 21:48 UTC).

Event Timeline

Tgr created this task.Oct 1 2020, 10:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 1 2020, 10:00 PM
JJMC89 added a subscriber: JJMC89.Oct 1 2020, 10:30 PM

Were people really logged out at 7:43 UTC? Judging from comments on en-wiki the log out was from 21:43 UTC

Urbanecm updated the task description. (Show Details)Oct 1 2020, 11:26 PM

Were people really logged out at 7:43 UTC? Judging from comments on en-wiki the log out was from 21:43 UTC

Updated the time.

Meirae added a subscriber: Meirae.Oct 2 2020, 12:13 AM
Apap04 added a subscriber: Apap04.Oct 2 2020, 12:19 AM

I use 2fA for Wikimedia sites and logged in again successfully. Is recommended to change the password anyways?

Majavah added a subscriber: Majavah.Oct 2 2020, 5:02 AM
Kaartic added a subscriber: Kaartic.Oct 2 2020, 5:53 AM

I use 2fA for Wikimedia sites and logged in again successfully. Is recommended to change the password anyways?

See https://lists.wikimedia.org/pipermail/wikitech-l/2020-October/093922.html which does not say that people should change passwords - thanks.

Base added a subscriber: Base.Oct 2 2020, 2:59 PM

Was the release mentioned rolled back before the logout was done?

I mean, if the problem happens when people log in, then to force *everyone* to re-login is to call for more instances of the problem to happen…

I see T264363, but it is quite too technical for me to follow.

the logout was occurred in all Wikimedia projects.

Was the release mentioned rolled back before the logout was done?

Yes.

I mean, if the problem happens when people log in, then to force *everyone* to re-login is to call for more instances of the problem to happen…

I see T264363, but it is quite too technical for me to follow.

We are not sure if T264363 is the issue - it may be unrelated.

This comment was removed by Urbanecm.

Isn't this the third or fourth time everyone has been forcibly logged out in the past year? How is this acceptable? Why does this keep happening and who's taking responsibility to ensure that it stops happening?

The undeniable truth of software engineering and computer security is that bugs happen. While techniques like automated testing and code review are used to try to prevent bugs from entering production, and deployment stages are used to minimize the impact of new bugs, those techniques are not capable of preventing every bug (and thus, every security issue) from hitting production servers. This is because all code is written by humans and humans are imperfect.

Trying to eliminate all security problems is a noble goal, but it is also a futile one. Instead, it is much more important to ensure that there is an effective response to those security problems once they are discovered. The forced logouts are an example of that plan in action. The only ways to prevent them would be to expect perfection from human developers, to stop developing and improving MediaWiki, or to stop adequately responding to security incidents. Those are all worse options.

We don't yet know what caused this specific incident. Only once that stage of the investigation is complete can we begin to discuss how similar issues might be prevented in the future. All indications at the moment, however, are that it is unrelated to the previous session caching issues in June.

hashar changed the task status from Open to Stalled.Oct 5 2020, 9:31 AM
hashar triaged this task as Unbreak Now! priority.
hashar added a subtask: Restricted Task.
hashar added a subscriber: hashar.

This task is the public facing task for the user authentication issue. It is blocking the train and is thus an unbreak now priority. I am marking it stalled pending resolution of the issue which is tracked internally in the private task T264369.

The undeniable truth of software engineering and computer security is that bugs happen. While techniques like automated testing and code review are used to try to prevent bugs from entering production, and deployment stages are used to minimize the impact of new bugs, those techniques are not capable of preventing every bug (and thus, every security issue) from hitting production servers. This is because all code is written by humans and humans are imperfect.

Trying to eliminate all security problems is a noble goal, but it is also a futile one. Instead, it is much more important to ensure that there is an effective response to those security problems once they are discovered. The forced logouts are an example of that plan in action. The only ways to prevent them would be to expect perfection from human developers, to stop developing and improving MediaWiki, or to stop adequately responding to security incidents. Those are all worse options.

These are very nice words, but none of them answer the questions I asked. When the "emergency door" is being used regularly, with widespread user-facing impact, it warrants a serious and thorough investigation into why we're repeatedly having issues that require such drastic preventative action.

I woke up on Friday morning to a text from a Wikipedia administrator who couldn't remember his password. Because he was on a school IP address that had been blocked, he was also offered no option to reset his password via e-mail. Wikimedia wikis already face significant challenges attracting and retaining contributors; forcibly logging all of them out is disruptive and a big deal.

We don't yet know what caused this specific incident. Only once that stage of the investigation is complete can we begin to discuss how similar issues might be prevented in the future. All indications at the moment, however, are that it is unrelated to the previous session caching issues in June.

Who specifically is leading this investigation you reference?

hashar added a comment.Oct 6 2020, 8:49 PM

Who specifically is leading this investigation you reference?

Hi, the issue is related to user authentication which is an highly sensible matter and unfortunately can not be discussed publicly. Rest assured it is being investigated by multiple persons since that involves multiple layers of the stack.

For your other questions, they are more general and would be better asked on another venue (mailing list or another task). Thanks!

dduvall lowered the priority of this task from Unbreak Now! to High.Oct 13 2020, 7:57 PM
dduvall added a subscriber: dduvall.

Lowering priority as new logging is in place and this is no longer a blocker of 1.36.0-wmf.11—the latter has been re-deployed for all wikis.

ema closed subtask Restricted Task as Invalid.Nov 24 2020, 9:33 AM
Tgr changed the status of subtask Restricted Task from Invalid to Resolved.Nov 24 2020, 10:00 AM
sbassett closed this task as Resolved.Dec 16 2020, 3:59 PM
sbassett claimed this task.
sbassett lowered the priority of this task from High to Low.
sbassett added a subscriber: sbassett.

Resolving for now per T264369#6644444.

sbassett removed sbassett as the assignee of this task.Dec 16 2020, 3:59 PM
Aklapper renamed this task from User authentication security issue (Oct 1) to User authentication security issue (Oct 1, 2020).Dec 16 2020, 4:02 PM

The task description states Investigation is ongoing, details will be shared later.

I see nothing in https://wikitech.wikimedia.org/wiki/Incident_documentation yet. If this issue has been sufficiently mitigated so that the relevant tasks can be marked resolved, now would be the time to begin publicly sharing information about the issue.

The task description states Investigation is ongoing, details will be shared later.

I see nothing in https://wikitech.wikimedia.org/wiki/Incident_documentation yet. If this issue has been sufficiently mitigated so that the relevant tasks can be marked resolved, now would be the time to begin publicly sharing information about the issue.

Yes indeed that is usually the process we follow for security issues. The public task (this T264370) is usually just a placeholder when the real work and investigation is captured in a different private task (since it potentially hold personal information, might be a threat to the infrastructure or leak counter measures to the attacker(s)).

The same occurs for the incident documentation. Along the Phabricator private task we also open a private document which is nicer to format and such a document has been created and polished. Unfortunately for the same reason as the private task, we can not share it.

What I can say is that the actual root cause has not been identified. We went through a lot of history and could not find any other use case. As a result of the investigation we have enabled massive logging on the server side infrastructure which, if the issue occurs again, would give us ample details and should let us pinpoint the actual root cause of the issue.

We haven't found the cause, it was a one time occurrence and we have added massive logged on the infrastructure to help analysis in the future if it occurs again. There is unfortunately not much more to say.