Session failures ("invalid CSRF token") preventing edits, login, logout, etc due to kask outage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	DannyS712
	Jun 11 2020, 6:44 PM

Description

Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes

Hi. I'm unable to save an edit on simple wikipedia:

Sorry! We could not process your edit due to a loss of session data.

You might have been logged out. Please verify that you're still logged in and try again. If it still does not work, try logging out and logging back in, and check that your browser allows cookies from this site.

When attempting to log out, the log out button triggers a notification Invalid CSRF token.
Navigating to https://simple.wikipedia.org/wiki/Special:UserLogout manually lets me hit submit, but each time I do so it just refreshes the page and doesn't log me out

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	jeena	T254173 1.35.0-wmf.36 deployment blockers
		Resolved		akosiaris	T255179 Session failures ("invalid CSRF token") preventing edits, login, logout, etc due to kask outage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Reedy added a subscriber: PKM.Jun 11 2020, 6:51 PM

JJMC89 removed a subscriber: pywikibot-bugs-list.Jun 11 2020, 6:51 PM

Xaosflux subscribed.Jun 11 2020, 6:51 PM

Oddly enough, I was able to edit logged out in an incognito window.

Shawn subscribed.Jun 11 2020, 6:53 PM

Proc unsubscribed.Jun 11 2020, 6:53 PM

Reedy merged a task: T255181: Cannot log in due to a "precaution against session hijacking".Jun 11 2020, 6:53 PM

Reedy merged a task: T255182: Error saving edit on Office Wiki.

Reedy added a subscriber: GTrang.

Reedy added a subscriber: WDoranWMF.

Urbanecm merged a task: T255182: Error saving edit on Office Wiki.Jun 11 2020, 6:54 PM

Sakretsu subscribed.Jun 11 2020, 6:54 PM

kostajh subscribed.Jun 11 2020, 6:56 PM

https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-1h&to=now

Office wiki is also affected

Hispano76 subscribed.Jun 11 2020, 6:57 PM

Reedy merged a task: T255184: unable to log out or edit: 'invalid CSRF token'.Jun 11 2020, 6:59 PM

Reedy added a subscriber: Effeietsanders.

WDoranWMF mentioned this in T255182: Error saving edit on Office Wiki.Jun 11 2020, 7:00 PM

WDoranWMF added a project: Platform Engineering.

That's odd: otrs-wiki is affected, but it doesn't use global accounts/centralauth...

Raymond subscribed.Jun 11 2020, 7:01 PM

Just checked, I can't edit Wikidata or log into QuickStatements.

Adithyak1997 subscribed.Jun 11 2020, 7:02 PM

Vachovec1 subscribed.Jun 11 2020, 7:02 PM

Seems to be working now.

Solved for fr.wikipedia.

Agusbou2015 subscribed.Jun 11 2020, 7:04 PM

Bugreporter added a project: Wikimedia-Incident.Jun 11 2020, 7:04 PM

M2k_dewiki subscribed.Jun 11 2020, 7:05 PM

Does appear to be fixed, likely caused by Thursday™

ABLouis subscribed.Jun 11 2020, 7:05 PM

Working for me on en.wikipedia. Issue started at roughly 18:36:49 per recent changes.

I can edit again, but protecting a page was somehow still prohibited (first attempt) but resubmission of the form worked around the issue. (nlwiki)

Meirae subscribed.Jun 11 2020, 7:08 PM

Still happening (loss of session data error on attempted save) on enwiki as of this time stamp.

For me, it's now 50/50 whether or not my tokens are valid or not across all Wikimedia Wikis. My issues started about 15 mins following the first report.

• Mholloway subscribed.Jun 11 2020, 7:10 PM

I had the issue for 10 minutes, seems to be gone by now (English Wikipedia, Wikidata, and authentication here)

Ladsgroup subscribed.Jun 11 2020, 7:13 PM

Working in es.wikipedia for me.

office wiki still affected

Working also in es.wikivoyage for me.

NahidSultan subscribed.Jun 11 2020, 7:19 PM

Aklapper renamed this task from Session failures preventing edits, login, logout, etc to Session failures preventing edits, login, logout, etc: "invalid CSRF token".Jun 11 2020, 7:20 PM

Flori4nK subscribed.Jun 11 2020, 7:20 PM

XanonymusX subscribed.Jun 11 2020, 7:23 PM

Lemure_Saltante subscribed.Jun 11 2020, 7:28 PM

office wiki working! :) thanks!

• Naleksuh unsubscribed.Jun 11 2020, 7:28 PM

Daimona subscribed.Jun 11 2020, 7:33 PM

Agusbou2015 closed this task as Resolved.Jun 11 2020, 7:34 PM

On commons csrf token error on loading (not consistent) and login errors on ruwiki

ban's not working, fast edit cancel works from time to time.

Agusbou2015 reopened this task as Open.Jun 11 2020, 7:34 PM

This might be due to the reason that you are trying to login even after you have logged in once. Try refreshing the page and clear cache

@Agusbou2015: Please do not resolve any tasks for no reason. This is up to developers. Thanks a lot.

GPSLeo subscribed.Jun 11 2020, 7:44 PM

I was affected on enwiki. I thought it was because of this CSS which I then removed. I thought it was my browser cookies. Turns out, it's the system.

In T255179#6216559, @BPirkle wrote:

https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-1h&to=now

https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=1591900083237&to=1591903561732

To summarise from the private incident documentation, requests to sessionstore increased from 15k req/s to 20k req/s, causing kask on kubernetes[1001,1003,1005] to go into a loop of being killed by the oom-killer. The problem was resolved after @akosiaris increased the number of pods and the memory limit for kask. Public incident documentation should appear here on wikitech at some point. I'm not sure if it is necessary to keep the task open while followups and documentation are done. Reducing the priority anyway, since the problem is resolved and apparently nobody is working on it anymore.

Antanana subscribed.Jun 12 2020, 5:58 AM

Aklapper mentioned this in T255234: Session hijacking: please provide instructions how to proceed.Jun 12 2020, 7:29 AM

Addshore subscribed.Jun 12 2020, 8:27 AM

MarcoAurelio subscribed.Jun 12 2020, 9:02 AM

Yes, absolutely agreed. The trigger was indeed insufficient capacity in sessionstore to handle a, at least, 33% (~15k to ~20k if not more) sudden increase in requests. We 've gone ahead and added capacity to the service and will follow up with adding more capacity to the entire cluster as well as the dedicated sessionstore nodes. So, I 'll be bold and resolve this, feel free to reopen though.

There's a number of actionables that will result from the incident doc[1] but those should be tracked in their own tasks.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes

Should something be done about users who thought they have logged out but actually didn't?

RLazarus subscribed.Jun 12 2020, 3:08 PM

In T255179#6218789, @Tgr wrote:

Should something be done about users who thought they have logged out but actually didn't?

I guess this is late so probably no, but how would we be identifying those ? What action would we be taking ? I am guessing forcibly logging them out?

In T255179#6227727, @akosiaris wrote:

I guess this is late so probably no, but how would we be identifying those ? What action would we be taking ? I am guessing forcibly logging them out?

That's what we did for similar issues in the past, yeah. If there's some log (e.g. in Logstash) to find out who attempted logout then based on that, otherwise decide if the impact was severe enough to log everyone out.

As an external observer, I'm fearful of "log everyone out". This will cause a spike in traffic to the authentication infrastructure as everybody logs back in again. Which, of course, was the root cause of this problem to begin with. Is there some way to do a rolling logout, killing say, 1% of the sessions per hour?

In T255179#6228151 RoySmith wrote

log everyone out

Why should this actually be done? I think only those who face issues now, requires this to be done I think. Also, since nobody has recently mentioned here that they face issues, I think by the message posted by akosiaris regarding insufficient capacity, I think the problem stands resolved.

In T255179#6228151, @Adithyak1997 wrote:

Why should this actually be done?

To protect the integrity and avoid exposing the personal data of users who have used Wikimedia sites on public computers or on someone else's computer and thought they logged out but actually didn't.

Jonesey95 unsubscribed.Jun 16 2020, 3:38 PM

In T255179#6228123, @RoySmith wrote:

As an external observer, I'm fearful of "log everyone out". This will cause a spike in traffic to the authentication infrastructure as everybody logs back in again. Which, of course, was the root cause of this problem to begin with. Is there some way to do a rolling logout, killing say, 1% of the sessions per hour?

We have added sufficient capacity to the sessionstore infrastructure, so I am not afraid of that event repeating in this manner even if we logged everyone out. The nuisance caused to all users (including all those that were not active during the incident) however is something I am worried about, so I 'd prefer if we did not do a global logout.

I was approached by two users of ukwiki this week about persisting "invalid CSRF token" error.
Could it be that the issue, described in this task, continues to happen?

In T255179#6314620, @Ata wrote:

Could it be that the issue, described in this task, continues to happen?

No. See T258121: Logging in to a wiki sometimes fails with 'sessionfailure' error (coinciding with SameSite rollout) for the ongoing issue.

Tgr renamed this task from Session failures preventing edits, login, logout, etc: "invalid CSRF token" to Session failures ("invalid CSRF token") preventing edits, login, logout, etc due to kask outage.Jul 17 2020, 10:40 AM

Tgr updated the task description. (Show Details)

Session failures ("invalid CSRF token") preventing edits, login, logout, etc due to kask outageClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Session failures ("invalid CSRF token") preventing edits, login, logout, etc due to kask outage
Closed, ResolvedPublic
Actions

Related Objects
Search...