Page MenuHomePhabricator

Reports of a high number of edits being rejected due to loss of session data
Closed, ResolvedPublic

Description

NOTE: Since the creation of this report, statistics on session loss are available to monitor progress.

Recently (i.e. since a few weeks), it happens way more frequently than usual that I have to save an edit twice to get it through, because the error

"session_fail_preview": "<strong>Sorry! We could not process your edit due to a loss of session data.</strong>\nPlease try again.\nIf it still does not work, try [[Special:UserLogout|logging out]] and logging back in.",

became way more frequent. I didn't change my editing patterns, for instance I'm not waiting between action=edit load and submit time more than I used to be, so the issue is server side.

I saw multiple reports of this feeling around, on IRC and at least two wikis. There were many changes related to caching recently, so that's the most obvious suspect.

Aside from the annoyance and productivity loss, the most obvious damage is that an unknown amount of edits go lost forever (when editors do not notice the edit was not saved).

Per @Whatamidoing-WMF's note to the operations list:

There's been a significant uptick in the number of complaints about people losing session data during the last few days. Some editors report that it's happening for a majority of sessions. The discussion at en.wp is here: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#.22Loss_of_session_data.22_error_on_Save_page

Related Objects

Mentioned In
T112446: Upload wizard fails with api-error-badtoken, cannot resubmit
T108985: Monitor MediaWiki sessions
T88635: ObjectCacheSessionHandler should avoid pointless writes in write()
rMW0c82a7a039f5: Revert I4afaecd8: Avoiding writing sessions for no reason
rMWe48fec5a8ab5: Revert I4afaecd8: Avoiding writing sessions for no reason
rOMWCd2813e1b8ae7: Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in…
rOMWCc3ee63d33229: Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199
rMW160f69871cea: Debug logging for T102199
rMWc72b7c435f00: Debug logging for T102199 (take 2)
rMWeb281630ce36: Debug logging for T102199
rOMWC9efc62ba3c87: Add a debug log channel for bug T102199
T106986: High number of (session) redis connection failures
T104326: Numerous sv.wp users get repeatedly logged out again
T106066: Don't show "Nonce already used" error on memcache failure
T104795: Frequent session data lost
rMW646fdc978c01: Added pre-emptive session renewal to avoid "random" submission errors
T101224: Stale pages after saved edit
T103236: Forces users to relogin every couple hundred edits ("loss of session data" error)
rMW1e7076c6e166: Instrument edit failures
rMWcf7df757f2b4: Instrument edit failures
Mentioned Here
rMW532ef7851c69: Avoiding writing sessions for no reason
rMW646fdc978c01: Added pre-emptive session renewal to avoid "random" submission errors
T107635: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36
T106986: High number of (session) redis connection failures
T106066: Don't show "Nonce already used" error on memcache failure
rMW833bdbab37cd: Fixed $flags bit operation precedence fail in User::loadFromDatabase()
rMW5399fba68b99: Use less fuzzy User::getDBTouched() in ApiStashEdit::getStashKey()
rMW5b2670b31b91: Made User::touch no longer call load()
T102928: when saving after editing for a long time, try to get a token automatically instead of showing session_fail_preview

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I wonder whether it has anything to do with @aaron's change for T102928.

"It" what? That was after this bug, in an attempt to improve things.

This bug. If something became worse in the last few days it may or may not be related.

I agree with Amire80, this is getting significantly worse. I made 13 edits and performed 13 checkusers between 0056 and 0415 hours UTC today (July 8). Of those, all but two of the edits failed on the first attempt, and at least 5 of the checkusers failed on the first attempt (usually when doing a second check but switching checking parameters). I am concerned that new users in particular may become frustrated and abandon attempts to edit.

Oh. Probably related: https://edit-analysis.wmflabs.org/compare/ "failure rates by type" shows that "bad-token" errors doubled from 2015-06-04 to 2015-06-06. There was no improvement in recent days.

Yeah, 3% is pretty high for this issue (historically it's been ~1–2% for VisualEditor users), but from our end it's unclear what's caused it; our investigations have not found anything useful, and in particular I've not been able to reproduce it.

I havn't seen this bug, when I used the "preview"-button on one of the open browser-tabs before clicking the "save"-button in the other tabs.

→ The preview-button is a work-around.

For statistics: Filter the Etids with "save"→fail→"preview"→"save"→success

Just happend to me (16:15 CET) on https://nl.wikipedia.org/wiki/Wikipedia:Te_beoordelen_pagina%27s/Toegevoegd_20150714#Toegevoegd_14.2F07:_Deel_1. I edited a text and immediately pressed save without previewing. Pressing save after getting the error message and the text got saved.

Glaisher raised the priority of this task from High to Unbreak Now!.Jul 14 2015, 5:43 PM

We would've lots of other unreported occurences of this and newbies (and regular users) not knowing what happened and thus losing constructive edits... Raising priority.

Memcached is still used for this stuff, right? Is this trend in mc_evictions on memcached servers normal? It seems to get faster around the end of May.

Update: no, $wgSessionCacheType is configured to use redis (which however unlike memcached is not contacted through nutcracker):

if ( $wmgUseClusterSession ) {
	require( getRealmSpecificFilename( "$wmfConfigDir/session.php" ) );

	$wgObjectCaches['sessions'] = array(
		'class' => 'RedisBagOStuff',
		'servers' => $sessionRedis[$wmfDatacenter],
		'password' => $wmgRedisPassword,
		'loggroup' => 'redis',
	);
}

Hmm https://ganglia.wikimedia.org/latest/?r=custom&cs=05%2F27%2F2015+00%3A00&ce=07%2F15%2F2015+00%3A00&c=Redis+eqiad

yes, I think this issue comes from Version 1.26wmf6 (1.26wmf5?)

Possibly fixed by restart of some misbehaving memcache servers (T106066#1458967).

We still have problems in it.wikiversity, most of the times I need to save twice

it is better now, but the issue is not solved:
Just before in [[de:Veröffentlichungen_von_WikiLeaks]], [[de:UN-Klimakonferenz_in_Kopenhagen_2009]] but not in [[de:Klimaskeptizismus]] some sekondes before.

"preview" is no longer required, a seconde "save" is enogh now. The frequenz of the issue declined.

unsure if related to T106986 so if anyone is still experiencing this please let us know

Change 227920 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199

https://gerrit.wikimedia.org/r/227920

Change 227921 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199

https://gerrit.wikimedia.org/r/227921

Change 227923 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199

https://gerrit.wikimedia.org/r/227923

Unless I'm misreading them, the statistics already confirm the severity of the issue: Wikipedias have about 4 edits/second (cf. editswiki which became en.wiki-only) and the logging counts 5 session failures per second. No improvement recently.

Change 228141 had a related patch set uploaded (by Ori.livneh):
Add a debug log channel for bug T102199

https://gerrit.wikimedia.org/r/228141

Change 228142 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199 (take 2)

https://gerrit.wikimedia.org/r/228142

Change 228142 merged by Ori.livneh:
Debug logging for T102199 (take 2)

https://gerrit.wikimedia.org/r/228142

Change 228141 merged by jenkins-bot:
Add a debug log channel for bug T102199

https://gerrit.wikimedia.org/r/228141

Distribution of session loss errors by rack of server which logged the error:

Rack% of Errors% of Traffic
A67.26%7.34%
A718.30%16.77%
B615.30%16.77%
B712.78%12.05%
B84.57%4.72%
C618.45%16.98%
D523.34%25.37%

Based on log data from 2015-07-30 21:06:41 UTC - 2015-07-31 04:52:37 UTC, representing 1,301 errors.

[palladium:~] $ sudo salt --out=raw -b25% -t60 -G 'deployment_target:scap/scap' cmd.run "grep -c 'error [^0]' /var/log/nutcracker/nutcracker.log"
ServerErrors
mw100118
mw100216
mw100323
mw100414
mw100712
mw100829
mw10096
mw10113
mw10126
mw101320
mw101412
mw10170
mw10197
mw102017
mw102211
mw10249
mw102512
mw10285
mw103012
mw10310
mw10339
mw103619
mw103720
mw10385
mw103918
mw10406
mw104111
mw104214
mw104421
mw104513
mw104624
mw10489
mw104915
mw105011
mw10524
mw105327
mw105414
mw105518
mw105624
mw10578
mw105920
mw10616
mw106225
mw106321
mw106514
mw106615
mw106712
mw106812
mw106914
mw107010
mw10727
mw107317
mw10756
mw107629
mw107818
mw107928
mw108122
mw108223
mw10835
mw10849
mw108510
mw10866
mw10905
mw109116
mw109324
mw109427
mw109611
mw109913
mw11003
mw11014
mw110219
mw11036
mw110511
mw11076
mw11089
mw110912
mw11104
mw11132
mw11143
mw11174
mw11195
mw112415
mw112516
mw11280
mw11302
mw11310
mw11320
mw11333
mw11349
mw11361
mw113716
mw11402
mw11410
mw11423
mw11437
mw11440
mw114517
mw11467
mw11470
mw11485
mw114916
mw115027
mw115117
mw11520
mw11530
mw11540
mw11550
mw11571
mw11581
mw11590
mw11600
mw11637
mw11659
mw116719
mw11691
mw117011
mw117127
mw117315
mw117512
mw117614
mw117716
mw117818
mw117912
mw118013
mw118522
mw118721
mw11887
mw11895
mw11903
mw11920
mw119417
mw119711
mw11981
mw120010
mw12014
mw12023
mw12037
mw12040
mw12055
mw120712
mw12084
mw120910
mw121023
mw121115
mw121412
mw121514
mw121615
mw121721
mw121813
mw122111
mw12224
mw122410
mw12269
mw12280
mw123014
mw12310
mw12320
mw123314
mw12342
mw12352
mw123613
mw123729
mw123828
mw123944
mw124029
mw124125
mw124217
mw124319
mw124420
mw124816
mw125014
mw125231
mw125318
mw125414
mw125740
mw125823

Change 228211 had a related patch set uploaded (by Ori.livneh):
Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199

https://gerrit.wikimedia.org/r/228211

Change 228211 merged by jenkins-bot:
Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199

https://gerrit.wikimedia.org/r/228211

Change 228216 had a related patch set uploaded (by Ori.livneh):
Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199"

https://gerrit.wikimedia.org/r/228216

Change 228216 merged by jenkins-bot:
Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199"

https://gerrit.wikimedia.org/r/228216

I used Salt to aggregate timeout errors in the nutcracker logs by destination server IP:

sudo salt --out=raw -b25% -t60 -G 'deployment_target:scap/scap' cmd.run "grep -Po \"(?<=nc_core.c:237 close s \d\d ')[^']+(?=.*Connection timed out$)\" /var/log/nutcracker/nutcracker.log | sort | uniq -c | sort -rn | head -3"/

I then mapped each server to its rack and aggregated timeout failures by source / destination racks:

DestinationSourceTimeouts
A5A6247
A5A7519
A5B6519
A5B7329
A5B8147
A5C6455
A5D5453
C8A78
C8B63
D8A64
D8A717
D8B634
D8B733
D8C640

Memcache / Redis hosts on the A5 rack represent only a third of the Memcache / Redis cluster but they are conspicuously overrepresented in the error logs.

One of the interfaces on asw2-a5-eqiad.mgmt.eqiad.wmnet is persistently getting input errors:

Network.png (337×597 px, 24 KB)

http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw2-a5-eqiad.mgmt.eqiad.wmnet//ae0//inerr

There are most likely two distinct issues here. The first is the subtle but longstanding fault in network equipment, shown in the graph above, and now tracked in T107635. This fault is plausibly causing occasional session loss errors, but it cannot explain any recent spike in such errors, if indeed there is one.

The second issue is a software defect somewhere in MediaWiki that is causing sessions to expire more quickly than we want them to. I haven't isolated this yet, but rMW646fdc9 and rMW532ef78 look suspect.

I reverted rMW646fdc9 at 20:14 UTC and the ratio of session loss errors to edits dipped considerably:

download.png (385×688 px, 49 KB)

The instrumentation that produces the data for that graph is likely underreporting the magnitude of the problem, because it is not consistent with what editors have been reporting. But it is probably underreporting by some constant factor. So the decline in errors that we're seeing is substantial and real.

Change 228423 had a related patch set uploaded (by Ori.livneh):
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228423

Change 228430 had a related patch set uploaded (by Ori.livneh):
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228430

Change 228423 merged by jenkins-bot:
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228423

Change 228430 merged by Ori.livneh:
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228430

Hello Ori - Since the change you made, I have had NO unexpected losses of sessions. This is a massive improvement, as I was verging toward 90% on checkuserwiki when I had the page open for as little as 2 minutes, and 100% if open more than 10 minutes.

Yeah, 'badtoken' events have gone down very significantly for VisualEditor (and so I imagine even more so for wikitext editor, which has a much worse user outcome when it fails): https://edit-analysis.wmflabs.org/compare/ (narrow the date range to 2015-07-27 – 2015-08-02). Down from ~ 5% (and so 1/3rd of all edit failures) down to 0.9% today.

Should we consider this fixed?

ori claimed this task.

@ori thank you for investing so much time in this !