Page MenuHomePhabricator

Session failures ("invalid CSRF token") preventing edits, login, logout, etc due to kask outage
Closed, ResolvedPublic

Description

Hi. I'm unable to save an edit on simple wikipedia:

Sorry! We could not process your edit due to a loss of session data.

You might have been logged out. Please verify that you're still logged in and try again. If it still does not work, try logging out and logging back in, and check that your browser allows cookies from this site.

When attempting to log out, the log out button triggers a notification Invalid CSRF token.
Navigating to https://simple.wikipedia.org/wiki/Special:UserLogout manually lets me hit submit, but each time I do so it just refreshes the page and doesn't log me out

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Oddly enough, I was able to edit logged out in an incognito window.

That's odd: otrs-wiki is affected, but it doesn't use global accounts/centralauth...

Just checked, I can't edit Wikidata or log into QuickStatements.

Does appear to be fixed, likely caused by Thursday™

Working for me on en.wikipedia. Issue started at roughly 18:36:49 per recent changes.

I can edit again, but protecting a page was somehow still prohibited (first attempt) but resubmission of the form worked around the issue. (nlwiki)

Still happening (loss of session data error on attempted save) on enwiki as of this time stamp.

For me, it's now 50/50 whether or not my tokens are valid or not across all Wikimedia Wikis. My issues started about 15 mins following the first report.

I had the issue for 10 minutes, seems to be gone by now (English Wikipedia, Wikidata, and authentication here)

Working also in es.wikivoyage for me.

Aklapper renamed this task from Session failures preventing edits, login, logout, etc to Session failures preventing edits, login, logout, etc: "invalid CSRF token".Jun 11 2020, 7:20 PM

office wiki working! :) thanks!

On commons csrf token error on loading (not consistent) and login errors on ruwiki

20200611_223158.png (324×660 px, 40 KB)
ban's not working, fast edit cancel works from time to time.

This might be due to the reason that you are trying to login even after you have logged in once. Try refreshing the page and clear cache

@Agusbou2015: Please do not resolve any tasks for no reason. This is up to developers. Thanks a lot.

I was affected on enwiki. I thought it was because of this CSS which I then removed. I thought it was my browser cookies. Turns out, it's the system.

tstarling lowered the priority of this task from Unbreak Now! to High.Jun 11 2020, 11:20 PM
tstarling added subscribers: akosiaris, tstarling.

To summarise from the private incident documentation, requests to sessionstore increased from 15k req/s to 20k req/s, causing kask on kubernetes[1001,1003,1005] to go into a loop of being killed by the oom-killer. The problem was resolved after @akosiaris increased the number of pods and the memory limit for kask. Public incident documentation should appear here on wikitech at some point. I'm not sure if it is necessary to keep the task open while followups and documentation are done. Reducing the priority anyway, since the problem is resolved and apparently nobody is working on it anymore.

akosiaris claimed this task.

Yes, absolutely agreed. The trigger was indeed insufficient capacity in sessionstore to handle a, at least, 33% (~15k to ~20k if not more) sudden increase in requests. We 've gone ahead and added capacity to the service and will follow up with adding more capacity to the entire cluster as well as the dedicated sessionstore nodes. So, I 'll be bold and resolve this, feel free to reopen though.

There's a number of actionables that will result from the incident doc[1] but those should be tracked in their own tasks.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes

Should something be done about users who thought they have logged out but actually didn't?

Should something be done about users who thought they have logged out but actually didn't?

I guess this is late so probably no, but how would we be identifying those ? What action would we be taking ? I am guessing forcibly logging them out?

I guess this is late so probably no, but how would we be identifying those ? What action would we be taking ? I am guessing forcibly logging them out?

That's what we did for similar issues in the past, yeah. If there's some log (e.g. in Logstash) to find out who attempted logout then based on that, otherwise decide if the impact was severe enough to log everyone out.

As an external observer, I'm fearful of "log everyone out". This will cause a spike in traffic to the authentication infrastructure as everybody logs back in again. Which, of course, was the root cause of this problem to begin with. Is there some way to do a rolling logout, killing say, 1% of the sessions per hour?

In T255179#6228151 RoySmith wrote

log everyone out

Why should this actually be done? I think only those who face issues now, requires this to be done I think. Also, since nobody has recently mentioned here that they face issues, I think by the message posted by akosiaris regarding insufficient capacity, I think the problem stands resolved.

Why should this actually be done?

To protect the integrity and avoid exposing the personal data of users who have used Wikimedia sites on public computers or on someone else's computer and thought they logged out but actually didn't.

As an external observer, I'm fearful of "log everyone out". This will cause a spike in traffic to the authentication infrastructure as everybody logs back in again. Which, of course, was the root cause of this problem to begin with. Is there some way to do a rolling logout, killing say, 1% of the sessions per hour?

We have added sufficient capacity to the sessionstore infrastructure, so I am not afraid of that event repeating in this manner even if we logged everyone out. The nuisance caused to all users (including all those that were not active during the incident) however is something I am worried about, so I 'd prefer if we did not do a global logout.

I was approached by two users of ukwiki this week about persisting "invalid CSRF token" error.
Could it be that the issue, described in this task, continues to happen?

Could it be that the issue, described in this task, continues to happen?

No. See T258121: Logging in to a wiki sometimes fails with 'sessionfailure' error (coinciding with SameSite rollout) for the ongoing issue.

Tgr renamed this task from Session failures preventing edits, login, logout, etc: "invalid CSRF token" to Session failures ("invalid CSRF token") preventing edits, login, logout, etc due to kask outage.Jul 17 2020, 10:40 AM
Tgr updated the task description. (Show Details)