Page MenuHomePhabricator

Session failures preventing edits, login, logout, etc: "invalid CSRF token"
Closed, ResolvedPublic

Description

Hi. I'm unable to save an edit on simple wikipedia:

Sorry! We could not process your edit due to a loss of session data.

You might have been logged out. Please verify that you're still logged in and try again. If it still does not work, try logging out and logging back in, and check that your browser allows cookies from this site.

When attempting to log out, the log out button triggers a notification Invalid CSRF token.
Navigating to https://simple.wikipedia.org/wiki/Special:UserLogout manually lets me hit submit, but each time I do so it just refreshes the page and doesn't log me out

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Proc added a subscriber: Proc.Thu, Jun 11, 6:50 PM
Izno added a subscriber: Izno.Thu, Jun 11, 6:51 PM

Oddly enough, I was able to edit logged out in an incognito window.

Shawn added a subscriber: Shawn.Thu, Jun 11, 6:53 PM
Proc removed a subscriber: Proc.Thu, Jun 11, 6:53 PM
debt added a subscriber: debt.Thu, Jun 11, 6:56 PM

Office wiki is also affected

That's odd: otrs-wiki is affected, but it doesn't use global accounts/centralauth...

Just checked, I can't edit Wikidata or log into QuickStatements.

Seems to be working now.

Solved for fr.wikipedia.

Does appear to be fixed, likely caused by Thursday™

Working for me on en.wikipedia. Issue started at roughly 18:36:49 per recent changes.

Effeietsanders added a comment.EditedThu, Jun 11, 7:08 PM

I can edit again, but protecting a page was somehow still prohibited (first attempt) but resubmission of the form worked around the issue. (nlwiki)

Meirae added a subscriber: Meirae.Thu, Jun 11, 7:08 PM

Still happening (loss of session data error on attempted save) on enwiki as of this time stamp.

Ed6767 added a subscriber: Ed6767.Thu, Jun 11, 7:08 PM

For me, it's now 50/50 whether or not my tokens are valid or not across all Wikimedia Wikis. My issues started about 15 mins following the first report.

Ymblanter added a subscriber: Ymblanter.EditedThu, Jun 11, 7:11 PM

I had the issue for 10 minutes, seems to be gone by now (English Wikipedia, Wikidata, and authentication here)

Working in es.wikipedia for me.

debt added a comment.Thu, Jun 11, 7:15 PM

office wiki still affected

Working also in es.wikivoyage for me.

Aklapper renamed this task from Session failures preventing edits, login, logout, etc to Session failures preventing edits, login, logout, etc: "invalid CSRF token".Thu, Jun 11, 7:20 PM
debt added a comment.Thu, Jun 11, 7:28 PM

office wiki working! :) thanks!

Naleksuh removed a subscriber: Naleksuh.Thu, Jun 11, 7:28 PM
Agusbou2015 closed this task as Resolved.Thu, Jun 11, 7:34 PM
Carn added a subscriber: Carn.Thu, Jun 11, 7:34 PM

On commons csrf token error on loading (not consistent) and login errors on ruwiki

ban's not working, fast edit cancel works from time to time.

Agusbou2015 reopened this task as Open.Thu, Jun 11, 7:34 PM

This might be due to the reason that you are trying to login even after you have logged in once. Try refreshing the page and clear cache

@Agusbou2015: Please do not resolve any tasks for no reason. This is up to developers. Thanks a lot.

GPSLeo added a subscriber: GPSLeo.Thu, Jun 11, 7:44 PM
Can_I_Log_In added a subscriber: Can_I_Log_In.EditedThu, Jun 11, 7:46 PM

I was affected on enwiki. I thought it was because of this CSS which I then removed. I thought it was my browser cookies. Turns out, it's the system.

tstarling lowered the priority of this task from Unbreak Now! to High.Thu, Jun 11, 11:20 PM

To summarise from the private incident documentation, requests to sessionstore increased from 15k req/s to 20k req/s, causing kask on kubernetes[1001,1003,1005] to go into a loop of being killed by the oom-killer. The problem was resolved after @akosiaris increased the number of pods and the memory limit for kask. Public incident documentation should appear here on wikitech at some point. I'm not sure if it is necessary to keep the task open while followups and documentation are done. Reducing the priority anyway, since the problem is resolved and apparently nobody is working on it anymore.

akosiaris closed this task as Resolved.Fri, Jun 12, 9:16 AM
akosiaris claimed this task.

Yes, absolutely agreed. The trigger was indeed insufficient capacity in sessionstore to handle a, at least, 33% (~15k to ~20k if not more) sudden increase in requests. We 've gone ahead and added capacity to the service and will follow up with adding more capacity to the entire cluster as well as the dedicated sessionstore nodes. So, I 'll be bold and resolve this, feel free to reopen though.

There's a number of actionables that will result from the incident doc[1] but those should be tracked in their own tasks.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes

Tgr added a subscriber: Tgr.Fri, Jun 12, 11:09 AM

Should something be done about users who thought they have logged out but actually didn't?

Should something be done about users who thought they have logged out but actually didn't?

I guess this is late so probably no, but how would we be identifying those ? What action would we be taking ? I am guessing forcibly logging them out?

Tgr added a comment.Tue, Jun 16, 2:01 PM

I guess this is late so probably no, but how would we be identifying those ? What action would we be taking ? I am guessing forcibly logging them out?

That's what we did for similar issues in the past, yeah. If there's some log (e.g. in Logstash) to find out who attempted logout then based on that, otherwise decide if the impact was severe enough to log everyone out.

As an external observer, I'm fearful of "log everyone out". This will cause a spike in traffic to the authentication infrastructure as everybody logs back in again. Which, of course, was the root cause of this problem to begin with. Is there some way to do a rolling logout, killing say, 1% of the sessions per hour?

Adithyak1997 added a comment.EditedTue, Jun 16, 2:25 PM

In T255179#6228151 RoySmith wrote

log everyone out

Why should this actually be done? I think only those who face issues now, requires this to be done I think. Also, since nobody has recently mentioned here that they face issues, I think by the message posted by akosiaris regarding insufficient capacity, I think the problem stands resolved.

Tgr added a comment.Tue, Jun 16, 2:50 PM

Why should this actually be done?

To protect the integrity and avoid exposing the personal data of users who have used Wikimedia sites on public computers or on someone else's computer and thought they logged out but actually didn't.

As an external observer, I'm fearful of "log everyone out". This will cause a spike in traffic to the authentication infrastructure as everybody logs back in again. Which, of course, was the root cause of this problem to begin with. Is there some way to do a rolling logout, killing say, 1% of the sessions per hour?

We have added sufficient capacity to the sessionstore infrastructure, so I am not afraid of that event repeating in this manner even if we logged everyone out. The nuisance caused to all users (including all those that were not active during the incident) however is something I am worried about, so I 'd prefer if we did not do a global logout.