Page MenuHomePhabricator

Repeated loss of session data on edit attempt
Closed, ResolvedPublicBUG REPORT

Description

Incident Report Document: https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_MediaWiki

If you are seeing issues, there is no need to comment here. People are working on it.

Steps to replicate the issue (include links if applicable):

  • Attempt to make an edit

What happens?:

Sorry! We could not process your edit due to a loss of session data.

You might have been logged out. Please verify that you're still logged in and try again. If it still does not work, try logging out and logging back in, and check that your browser allows cookies from this site.

What should have happened instead?:

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

  • Issue persists across multiple projects (enwiki, testwiki) and when retrying/refreshing and retrying
<TheresNoTime> Successful wiki edits has just started to drop, users reported repeated "loss of session data" persisting a refresh
<p858snake> from -tech <phuzion> Anybody else seeing issues with logging in right now?
* Daimona (~Daimona@wikipedia/Daimona-Eaytoy) has joined
<urandom> TheresNoTime: I just restarted a session storage server
<urandom> that should not cause a problem, but the timing is undeniable :/
<phuzion> urandom: Yeah, I'm getting this error: "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form. You may receive this message if you are blocking cookies."
<jouncebot> RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230124T2100).
<jouncebot> nray, jan_drewniak, and cirno: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
<urandom> the restart caused a spike of 500s 
<urandom> 500s from the storage service
<TheresNoTime> put together T327815 very quick for timing
<stashbot> T327815: Repeated loss of session data on edit attempt - https://phabricator.wikimedia.org/T327815
* Oshwah has quit (Ping timeout: 256 seconds)
<logmsgbot> !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host sessionstore1001.eqiad.wmnet
<TheresNoTime> I can deploy, & holding the deploy per T327815
<urandom> I think we need to restart kask
<wikibugs> (PS2) Bking: dse-k8s: add rdf-streaming-updater namespace [puppet] - https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836)
<TheresNoTime> !log holding UTC late backport window for outage, T327815
<urandom> I am reading wikitech to figure out how to do this, so if anyone knows without reading, help would be appreciated :)
<stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
<wikibugs> (CR) Bking: dse-k8s: add rdf-streaming-updater namespace (1 comment) [puppet] - https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: Bking)
* sirenbot has changed the topic to: Status: Up (see https://www.wikimediastatus.net/) | Server Admin Log: https://w.wiki/6ah | This channel is logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-operations/ | SRE Clinic Duty: claime | SREs on call: brett
* jgleeson is now known as jgleeson|brb
<icinga-wm> PROBLEM - MediaWiki edit session loss on graphite1005 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13
<urandom> here we go (https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_restart)
<taavi> urandom: do you need help? should someone klaxon for help?
<icinga-wm> PROBLEM - MediaWiki centralauth errors on graphite1005 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
* giraffe (~giraffe@wikimedia/giraffer) has joined
<logmsgbot> !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync
<logmsgbot> !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync
<stashbot> eevans@deploy1002: Failed to log message to wiki. Somebody should check the error logs.
* mikehawk10 (~mikehawk1@wikipedia/Mikehawk10) has joined
<logmsgbot> !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
<stashbot> jclark@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
<logmsgbot> !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1011.eqiad.wmnet with OS bullseye
<stashbot> jclark@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
<wikibugs> SRE, ops-eqiad, DC-Ops, Data-Engineering-Planning, Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye...
<logmsgbot> !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host druid1009.eqiad.wmnet with OS bullseye
<stashbot> jclark@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
<wikibugs> SRE, ops-eqiad, DC-Ops, Data-Engineering-Planning, Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host druid1009.eqiad.wmnet with OS bull...
<urandom> Ok, that didn't seem to work
<urandom> brett: you around?
* ShakespeareFan00 (~Shakespea@88.97.96.89) has joined
<ShakespeareFan00> Hi.
<RhinosF1> urandom: if you are unsure, please klaxon
<ShakespeareFan00> I can't seem to login
<RhinosF1> ShakespeareFan00: aware of login issues
<p858snake> ShakespeareFan00: known issue
<phuzion> ShakespeareFan00: Known issue.
* ChanServ gives channel operator status to TheresNoTime
* TheresNoTime has changed the topic to: Status: Known session issues, T327815 (see https://www.wikimediastatus.net/) | Server Admin Log: https://w.wiki/6ah | This channel is logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-operations/ | SRE Clinic Duty: claime | SREs on call: brett
<mikehawk10> it's https://phabricator.wikimedia.org/T327815
* Oshwah (~Oshwah@wikipedia/oshwah) has joined
<taavi> ShakespeareFan00: (and everyone else) please report issues in #wikimedia-tech in the future please, that way we can focus on solving the issue here
<ShakespeareFan00> It's complaining about not getting an 'active' token
<jinxer-wm> (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
<RhinosF1> ShakespeareFan00: -tech
<urandom> Looks like taavi already did
<taavi> urandom: I did a manual page to get some more eyes on it
<urandom> taavi: I saw that, thanks
* brett present
<taavi> so what exaclty is broken? I can't seem to find any useful error messages from logstash
<James_F> Loss of session data is normally a server-down issue, right?
<taavi> yeah. urandom was apparently rebooting sessionstore1001 just a bit earlier which probably caused something to break
<James_F> But we didn't have a network issue FWICT.
<urandom> I restarted sessionstore1001, no impact was expected, but we're getting lots of 500s
<James_F> Hmm.
<taavi> sessionstore uses cassandra, right? is the cassandra cluster happy?
* MatmaRex has quit (Quit: Miranda NG)
<taavi> meta stuff: we need an IC, and also a wikimediastatus.net update
<brett> I'll do it
* ShakespeareFan00 (~Shakespea@88.97.96.89) has left
<brett> Who's looking into cassandra?
<urandom> I have, it's find
<James_F> https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1 shows a big drop in HTTP 5xx error responses starting at ~12:48 UTC, I assume that's unrelated?
<urandom> I have, it's fine
<urandom> I suspect it's Kask
<urandom> I restarted, but that didn't seem to fix it
* jclark-ctr (sid380693@id-380693.uxbridge.irccloud.com) has joined
<urandom> brett: do you know how to depool eqiad?
<taavi> did you restart both DCs or just eqiad?
<urandom> just eqiad
<urandom> codfw seems ok
<urandom> and the node I rebooted was eqiad
<James_F> https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=now-1h&to=now has a discontinuity at 20:54 on the edit rate, which is more likely a symptom.
<urandom> wait wait
<jinxer-wm> (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
<taavi> sessionstore logs say 'Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)'
<James_F> Loads of save failures, as people report.
<urandom> taavi: how recent is that?
<brett> urandom: I could but I'm IC right now, shouldn't I be doing the meta?
<logmsgbot> !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS bullseye
<urandom> I actually think it's righted itself
<wikibugs> SRE, Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6001.drmrs.wmnet with OS bullseye
<taavi> on a random pod I can see last one was at 21:10Z
<James_F> Yeah, edit rate has recovered, error rate has dropped.
* jgleeson|brb is now known as jgleeson
<giraffe> urandom: yeah people on discord are reporting it coming back 
<taavi> also seeing new entries for 'http: TLS handshake error from 10.2.2.11:48400', 10.2.2.11 is apertium.svc.eqiad.wmnet
<urandom> taavi: good, yeah, that was the last one I saw too
<James_F> Start 20:55, ~end 21:15 dropping from 21:11.
<James_F> Is that from the restart? Or earlier?
<urandom> I think it's the restart
<cirno> Hi, sorry I'm late
<James_F> So it looked OK in the logs but a restart made it actually work?
<taavi> SAL says reboot cookbook started at 20:50 and ended at 21:01
<urandom> no no, the restart precipitated all of this
<James_F> Oh, so it's just recovery from the restart that made things work?
<urandom> taavi: that's not quite right, it came up before that but the cookbook was timing out on the SSL icinga alert (which is not critical)
<wikibugs> (PS1) Ottomata: WIP - install pyflink deps with pip [docker-images/production-images] - https://gerrit.wikimedia.org/r/883278 (https://phabricator.wikimedia.org/T327494)
<James_F> OK, so do we want to declare the incident resolved?
<urandom> James_F: the node restart set everything in motion, the node came back up but the service spiraled anyway.  I kicked off a restart of the service, and while the recovery took longer than I expected, I think that was what fixed it
* James_F nods.
<urandom> the errors are consistent with (a still) buggy connection pool, is while I have reasonable confidence that that is what fixed it
<urandom> s/while/why/
<brett> urandom: For clarity, which service?
<urandom> kask/sessionstore
<cdanis> buggy connection pool or underprovisioning were my first ideas
* TheresNoTime has changed the topic to: Status: Up (see https://www.wikimediastatus.net/) | Server Admin Log: https://w.wiki/6ah | This channel is logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-operations/ | SRE Clinic Duty: claime | SREs on call: brett
* TheresNoTime removes channel operator status from TheresNoTime
<cdanis> sessionstore network traffic basically doubled for the duration of the outage, I'm not sure if that was because of rapid retries from clients or what
<jinxer-wm> (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
<urandom> My theory is: Host goes down and connections to it are terminated (connection objects removed from the pool), the node comes back up but the connections are not reestablished (because reasons).  But Cassandra sees the returned node as a viable candidate (just one that is now unreachable by the service)
<urandom> cdanis: I was thinking retries, yeah
<TheresNoTime> I intend to start the deployment window shortly (ping nray as first in the queue) — are we happy with that idea, or should I continue to hold?
<James_F> Follow-ups would be a safer (drain/provision) restart procedure for kask and a less aggressive retries rate?
<icinga-wm> RECOVERY - MediaWiki centralauth errors on graphite1005 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1
* giraffe (~giraffe@wikimedia/giraffer) has left (lounge.theresnogit.com - a TheresNoTime offering)
<James_F> TheresNoTime: That should be fine.
<urandom> James_F: actually, any safer procedure would be a  work-around (though in the near-time that is worth exploring), the connection pooling needs to be fixed ultimately.  A higher node count would help a TON too.
<James_F> Oh, also I should get that Beta config thing out.
<logmsgbot> !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage
<TheresNoTime> nray: around for https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/882727/ ?
<nray> @TheresNoTime Yes I am!
<wikibugs> (CR) TrainBranchBot: [C: +2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.19) - https://gerrit.wikimedia.org/r/882727 (https://phabricator.wikimedia.org/T327460) (owner: Nray)
<James_F> urandom: Right. brett did you start a doc?
<James_F> We should capture these thoughts. :-)
<brett> James_F: Not yet, writing a post-incident status page atm. Have copy/pastes in a vim buffer
<James_F> Ack.
<brett> Title: "15-minute outage for editing users"
<cdanis> brett: I'd say "Editing broken for 15 minutes" but yours is good too :)
<James_F> I've started https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_MediaWiki
<brett> > Wikipedia's session store suffered an outage for about 15 minutes. This caused users to be unable to log in or edit pages. Services are back online and review is in progress.
<brett> All good?
<James_F> LGTM.
<logmsgbot> !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1009.eqiad.wmnet with reason: host reimage
<cdanis> brett: LGTM thanks!
<urandom> brett: LGTM
<icinga-wm> RECOVERY - MediaWiki edit session loss on graphite1005 is OK: OK: Less than 30.00% above the threshold [10.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13
<urbanecm> should we resolve T327815?
<stashbot> T327815: Repeated loss of session data on edit attempt - https://phabricator.wikimedia.org/T327815
<James_F> Yes.
<icinga-wm> PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
<James_F> Done.
<brett> James_F: Filling in the incident doc now
<icinga-wm> RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
<James_F> brett: <3
* urbanecm closed it as well. 

Related Objects

Event Timeline

Anecdotal reports from several users of concurrent issues logging in.

On Wikidata lots of CSRF invalid token errors in these minutes.

Mentioned in SAL (#wikimedia-operations) [2023-01-24T21:03:29Z] <TheresNoTime> holding UTC late backport window for outage, T327815

I attempted to save some edits on enwiki, and got a token error of some sort (didn't think to save or copy it anywhere). But in the error, it suggested I try logging out, so I did.

I was greeted with this error when attempting to log back in:

"There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form. You may receive this message if you are blocking cookies."

This was verified across Firefox and Chrome, and was reproducible in both browsers.

A little while later, the error changed to:

"No active login attempt is in progress for your session."

If you are seeing session related errors, there is no need to add drive-by comments with the message or saying you are.

SRE are aware and will provide an update as needed.

TheresNoTime lowered the priority of this task from Unbreak Now! to Needs Triage.Jan 24 2023, 9:25 PM
Jdforrester-WMF claimed this task.
Jdforrester-WMF subscribed.

Starting incident report at https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_MediaWiki – this should now be over.

Urbanecm subscribed.

This should now be resolved. Feel free to reopen should this still happen to you.