Reports of a high number of edits being rejected due to loss of session data
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	Jun 11 2015, 11:43 PM

Referenced Files

	F292522: download.png
	Jul 31 2015, 11:10 PM

	F292403: Network.png
	Jul 31 2015, 9:19 PM

Subscribers

View All 37 Subscribers

Tokens

"Orange Medal" token, awarded by Slaporte."Like" token, awarded by Steinsplitter.

Description

NOTE: Since the creation of this report, statistics on session loss are available to monitor progress.

Recently (i.e. since a few weeks), it happens way more frequently than usual that I have to save an edit twice to get it through, because the error

"session_fail_preview": "<strong>Sorry! We could not process your edit due to a loss of session data.</strong>\nPlease try again.\nIf it still does not work, try [[Special:UserLogout|logging out]] and logging back in.",

became way more frequent. I didn't change my editing patterns, for instance I'm not waiting between action=edit load and submit time more than I used to be, so the issue is server side.

I saw multiple reports of this feeling around, on IRC and at least two wikis. There were many changes related to caching recently, so that's the most obvious suspect.

Aside from the annoyance and productivity loss, the most obvious damage is that an unknown amount of edits go lost forever (when editors do not notice the edit was not saved).

Per @Whatamidoing-WMF's note to the operations list:

There's been a significant uptick in the number of complaints about people losing session data during the last few days. Some editors report that it's happening for a majority of sessions. The discussion at en.wp is here: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#.22Loss_of_session_data.22_error_on_Save_page

Details

Subject	Repo	Branch	Lines +/-
Revert I4afaecd8: Avoiding writing sessions for no reason	mediawiki/core	master	+5 -23
Revert I4afaecd8: Avoiding writing sessions for no reason	mediawiki/core	wmf/1.26wmf16	+5 -23
Added pre-emptive session renewal to avoid "random" submission errors	mediawiki/core	master	+18 -2
Instrument edit failures	mediawiki/core	master	+8 -0
Instrument edit failures	mediawiki/core	wmf/1.26wmf9	+8 -0

Customize query in gerrit

Related Objects

Mentioned In: T112446: Upload wizard fails with api-error-badtoken, cannot resubmit
T108985: Monitor MediaWiki sessions
T88635: ObjectCacheSessionHandler should avoid pointless writes in write()
rMW0c82a7a039f5: Revert I4afaecd8: Avoiding writing sessions for no reason
rMWe48fec5a8ab5: Revert I4afaecd8: Avoiding writing sessions for no reason
rOMWCd2813e1b8ae7: Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in…
rOMWCc3ee63d33229: Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199
rMW160f69871cea: Debug logging for T102199
rMWc72b7c435f00: Debug logging for T102199 (take 2)
rMWeb281630ce36: Debug logging for T102199
rOMWC9efc62ba3c87: Add a debug log channel for bug T102199
T106986: High number of (session) redis connection failures
T104326: Numerous sv.wp users get repeatedly logged out again
T106066: Don't show "Nonce already used" error on memcache failure
T104795: Frequent session data lost
rMW646fdc978c01: Added pre-emptive session renewal to avoid "random" submission errors
T101224: Stale pages after saved edit
T103236: Forces users to relogin every couple hundred edits ("loss of session data" error)
rMW1e7076c6e166: Instrument edit failures
rMWcf7df757f2b4: Instrument edit failures
Mentioned Here: rMW532ef7851c69: Avoiding writing sessions for no reason
rMW646fdc978c01: Added pre-emptive session renewal to avoid "random" submission errors
T107635: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36
T106986: High number of (session) redis connection failures
T106066: Don't show "Nonce already used" error on memcache failure
rMW833bdbab37cd: Fixed $flags bit operation precedence fail in User::loadFromDatabase()
rMW5399fba68b99: Use less fuzzy User::getDBTouched() in ApiStashEdit::getStashKey()
rMW5b2670b31b91: Made User::touch no longer call load()
T102928: when saving after editing for a long time, try to get a token automatically instead of showing session_fail_preview

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Nemo_bis added a subscriber: Jdforrester-WMF.Jul 8 2015, 11:00 AM

In T102199#1434229, @Nemo_bis wrote:

I wonder whether it has anything to do with @aaron's change for T102928.

"It" what? That was after this bug, in an attempt to improve things.

This bug. If something became worse in the last few days it may or may not be related.

I can't find much before 2015-06-06 in core: 5b2670b31b913a43bc98ea6d1a74e13a5de0237a, 5399fba68b99d2a4fb3ec681fd4d426ca9739a1a, 833bdbab37cda3fabe8b69067bf1637a301c6af5.

I agree with Amire80, this is getting significantly worse. I made 13 edits and performed 13 checkusers between 0056 and 0415 hours UTC today (July 8). Of those, all but two of the edits failed on the first attempt, and at least 5 of the checkusers failed on the first attempt (usually when doing a second check but switching checking parameters). I am concerned that new users in particular may become frustrated and abandon attempts to edit.

In T102199#1437211, @Nemo_bis wrote:

Oh. Probably related: https://edit-analysis.wmflabs.org/compare/ "failure rates by type" shows that "bad-token" errors doubled from 2015-06-04 to 2015-06-06. There was no improvement in recent days.

Yeah, 3% is pretty high for this issue (historically it's been ~1–2% for VisualEditor users), but from our end it's unclear what's caused it; our investigations have not found anything useful, and in particular I've not been able to reproduce it.

TheDJ removed a project: Patch-For-Review.Jul 9 2015, 7:47 AM

I havn't seen this bug, when I used the "preview"-button on one of the open browser-tabs before clicking the "save"-button in the other tabs.

→ The preview-button is a work-around.

For statistics: Filter the Etids with "save"→fail→"preview"→"save"→success

Nemo_bis added subscribers: Reading-Infrastructure-Team-Old (Don't use), Contributors-Team.Jul 10 2015, 4:44 PM

Issue brought up again by several users here: https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=671097899#Loss_of_session_data

I've also brought this up on WMF's engineering mailing list.

Just happend to me (16:15 CET) on https://nl.wikipedia.org/wiki/Wikipedia:Te_beoordelen_pagina%27s/Toegevoegd_20150714#Toegevoegd_14.2F07:_Deel_1. I edited a text and immediately pressed save without previewing. Pressing save after getting the error message and the text got saved.

We would've lots of other unreported occurences of this and newbies (and regular users) not knowing what happened and thus losing constructive edits... Raising priority.

Memcached is still used for this stuff, right? Is this trend in mc_evictions on memcached servers normal? It seems to get faster around the end of May.

Update: no, $wgSessionCacheType is configured to use redis (which however unlike memcached is not contacted through nutcracker):

if ( $wmgUseClusterSession ) {
	require( getRealmSpecificFilename( "$wmfConfigDir/session.php" ) );

	$wgObjectCaches['sessions'] = array(
		'class' => 'RedisBagOStuff',
		'servers' => $sessionRedis[$wmfDatacenter],
		'password' => $wmgRedisPassword,
		'loggroup' => 'redis',
	);
}

Hmm https://ganglia.wikimedia.org/latest/?r=custom&cs=05%2F27%2F2015+00%3A00&ce=07%2F15%2F2015+00%3A00&c=Redis+eqiad

Nemo_bis added a project: MediaWiki-libs-BagOStuff.Jul 15 2015, 8:55 AM

Aklapper merged a task: T104795: Frequent session data lost.Jul 16 2015, 2:18 PM

Aklapper mentioned this in T104795: Frequent session data lost.

Aklapper added subscribers: zhuyifei1999, Andyrom75.

yes, I think this issue comes from Version 1.26wmf6 (1.26wmf5?)

Nemo_bis added a project: Sustainability.Jul 17 2015, 6:56 AM

Nemo_bis mentioned this in T106066: Don't show "Nonce already used" error on memcache failure.Jul 17 2015, 10:39 AM

Aklapper mentioned this in T104326: Numerous sv.wp users get repeatedly logged out again.Jul 17 2015, 1:30 PM

Possibly fixed by restart of some misbehaving memcache servers (T106066#1458967).

We still have problems in it.wikiversity, most of the times I need to save twice

Nemo_bis added a project: MediaWiki-Core-AuthManager.Jul 18 2015, 6:57 PM

it is better now, but the issue is not solved:
Just before in [[de:Veröffentlichungen_von_WikiLeaks]], [[de:UN-Klimakonferenz_in_Kopenhagen_2009]] but not in [[de:Klimaskeptizismus]] some sekondes before.

"preview" is no longer required, a seconde "save" is enogh now. The frequenz of the issue declined.

Ciencia_Al_Poder merged a task: T106985: Edit cannot be saved because of loss of Session data .Jul 26 2015, 4:34 PM

Ciencia_Al_Poder added a subscriber: AdamCuerden.

hoo mentioned this in T106986: High number of (session) redis connection failures.Jul 26 2015, 4:50 PM

Confirm this is almost ubiquitous. And possibly linked to https://phabricator.wikimedia.org/T101224

Edgars2007 subscribed.Jul 28 2015, 9:27 AM

unsure if related to T106986 so if anyone is still experiencing this please let us know

Nemo_bis updated the task description. (Show Details)Jul 29 2015, 8:12 PM

Change 227920 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199

https://gerrit.wikimedia.org/r/227920

Change 227921 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199

https://gerrit.wikimedia.org/r/227921

Change 227921 merged by Ori.livneh:
Debug logging for T102199

https://gerrit.wikimedia.org/r/227921

Change 227923 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199

https://gerrit.wikimedia.org/r/227923

Change 227923 merged by Ori.livneh:
Debug logging for T102199

https://gerrit.wikimedia.org/r/227923

Unless I'm misreading them, the statistics already confirm the severity of the issue: Wikipedias have about 4 edits/second (cf. editswiki which became en.wiki-only) and the logging counts 5 session failures per second. No improvement recently.

Change 228141 had a related patch set uploaded (by Ori.livneh):
Add a debug log channel for bug T102199

https://gerrit.wikimedia.org/r/228141

Change 228142 had a related patch set uploaded (by Ori.livneh):
Debug logging for T102199 (take 2)

https://gerrit.wikimedia.org/r/228142

Change 228142 merged by Ori.livneh:
Debug logging for T102199 (take 2)

https://gerrit.wikimedia.org/r/228142

Change 228141 merged by jenkins-bot:
Add a debug log channel for bug T102199

https://gerrit.wikimedia.org/r/228141

ori mentioned this in rOMWC9efc62ba3c87: Add a debug log channel for bug T102199.Jul 30 2015, 10:12 PM

ori mentioned this in rMWeb281630ce36: Debug logging for T102199.Jul 30 2015, 10:15 PM

ori mentioned this in rMWc72b7c435f00: Debug logging for T102199 (take 2).

ori mentioned this in rMW160f69871cea: Debug logging for T102199.

Distribution of session loss errors by rack of server which logged the error:

Rack	% of Errors	% of Traffic
A6	7.26%	7.34%
A7	18.30%	16.77%
B6	15.30%	16.77%
B7	12.78%	12.05%
B8	4.57%	4.72%
C6	18.45%	16.98%
D5	23.34%	25.37%

Based on log data from 2015-07-30 21:06:41 UTC - 2015-07-31 04:52:37 UTC, representing 1,301 errors.

[palladium:~] $ sudo salt --out=raw -b25% -t60 -G 'deployment_target:scap/scap' cmd.run "grep -c 'error [^0]' /var/log/nutcracker/nutcracker.log"

Server	Errors
mw1001	18
mw1002	16
mw1003	23
mw1004	14
mw1007	12
mw1008	29
mw1009	6
mw1011	3
mw1012	6
mw1013	20
mw1014	12
mw1017	0
mw1019	7
mw1020	17
mw1022	11
mw1024	9
mw1025	12
mw1028	5
mw1030	12
mw1031	0
mw1033	9
mw1036	19
mw1037	20
mw1038	5
mw1039	18
mw1040	6
mw1041	11
mw1042	14
mw1044	21
mw1045	13
mw1046	24
mw1048	9
mw1049	15
mw1050	11
mw1052	4
mw1053	27
mw1054	14
mw1055	18
mw1056	24
mw1057	8
mw1059	20
mw1061	6
mw1062	25
mw1063	21
mw1065	14
mw1066	15
mw1067	12
mw1068	12
mw1069	14
mw1070	10
mw1072	7
mw1073	17
mw1075	6
mw1076	29
mw1078	18
mw1079	28
mw1081	22
mw1082	23
mw1083	5
mw1084	9
mw1085	10
mw1086	6
mw1090	5
mw1091	16
mw1093	24
mw1094	27
mw1096	11
mw1099	13
mw1100	3
mw1101	4
mw1102	19
mw1103	6
mw1105	11
mw1107	6
mw1108	9
mw1109	12
mw1110	4
mw1113	2
mw1114	3
mw1117	4
mw1119	5
mw1124	15
mw1125	16
mw1128	0
mw1130	2
mw1131	0
mw1132	0
mw1133	3
mw1134	9
mw1136	1
mw1137	16
mw1140	2
mw1141	0
mw1142	3
mw1143	7
mw1144	0
mw1145	17
mw1146	7
mw1147	0
mw1148	5
mw1149	16
mw1150	27
mw1151	17
mw1152	0
mw1153	0
mw1154	0
mw1155	0
mw1157	1
mw1158	1
mw1159	0
mw1160	0
mw1163	7
mw1165	9
mw1167	19
mw1169	1
mw1170	11
mw1171	27
mw1173	15
mw1175	12
mw1176	14
mw1177	16
mw1178	18
mw1179	12
mw1180	13
mw1185	22
mw1187	21
mw1188	7
mw1189	5
mw1190	3
mw1192	0
mw1194	17
mw1197	11
mw1198	1
mw1200	10
mw1201	4
mw1202	3
mw1203	7
mw1204	0
mw1205	5
mw1207	12
mw1208	4
mw1209	10
mw1210	23
mw1211	15
mw1214	12
mw1215	14
mw1216	15
mw1217	21
mw1218	13
mw1221	11
mw1222	4
mw1224	10
mw1226	9
mw1228	0
mw1230	14
mw1231	0
mw1232	0
mw1233	14
mw1234	2
mw1235	2
mw1236	13
mw1237	29
mw1238	28
mw1239	44
mw1240	29
mw1241	25
mw1242	17
mw1243	19
mw1244	20
mw1248	16
mw1250	14
mw1252	31
mw1253	18
mw1254	14
mw1257	40
mw1258	23

Change 228211 had a related patch set uploaded (by Ori.livneh):
Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199

https://gerrit.wikimedia.org/r/228211

Change 228211 merged by jenkins-bot:
Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199

https://gerrit.wikimedia.org/r/228211

ori mentioned this in rOMWCc3ee63d33229: Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199.Jul 31 2015, 8:18 AM

Change 228216 had a related patch set uploaded (by Ori.livneh):
Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199"

https://gerrit.wikimedia.org/r/228216

Change 228216 merged by jenkins-bot:
Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199"

https://gerrit.wikimedia.org/r/228216

ori mentioned this in rOMWCd2813e1b8ae7: Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in….Jul 31 2015, 8:42 AM

I used Salt to aggregate timeout errors in the nutcracker logs by destination server IP:

sudo salt --out=raw -b25% -t60 -G 'deployment_target:scap/scap' cmd.run "grep -Po \"(?<=nc_core.c:237 close s \d\d ')[^']+(?=.*Connection timed out$)\" /var/log/nutcracker/nutcracker.log | sort | uniq -c | sort -rn | head -3"/

I then mapped each server to its rack and aggregated timeout failures by source / destination racks:

Destination	Source	Timeouts
A5	A6	247
A5	A7	519
A5	B6	519
A5	B7	329
A5	B8	147
A5	C6	455
A5	D5	453
C8	A7	8
C8	B6	3
D8	A6	4
D8	A7	17
D8	B6	34
D8	B7	33
D8	C6	40

Memcache / Redis hosts on the A5 rack represent only a third of the Memcache / Redis cluster but they are conspicuously overrepresented in the error logs.

One of the interfaces on asw2-a5-eqiad.mgmt.eqiad.wmnet is persistently getting input errors:

http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw2-a5-eqiad.mgmt.eqiad.wmnet//ae0//inerr

BBlack added a subtask: T107635: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36.Jul 31 2015, 10:35 PM

There are most likely two distinct issues here. The first is the subtle but longstanding fault in network equipment, shown in the graph above, and now tracked in T107635. This fault is plausibly causing occasional session loss errors, but it cannot explain any recent spike in such errors, if indeed there is one.

The second issue is a software defect somewhere in MediaWiki that is causing sessions to expire more quickly than we want them to. I haven't isolated this yet, but rMW646fdc9 and rMW532ef78 look suspect.

I reverted rMW646fdc9 at 20:14 UTC and the ratio of session loss errors to edits dipped considerably:

The instrumentation that produces the data for that graph is likely underreporting the magnitude of the problem, because it is not consistent with what editors have been reporting. But it is probably underreporting by some constant factor. So the decline in errors that we're seeing is substantial and real.

Change 228423 had a related patch set uploaded (by Ori.livneh):
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228423

Change 228430 had a related patch set uploaded (by Ori.livneh):
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228430

Change 228423 merged by jenkins-bot:
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228423

ori mentioned this in rMWe48fec5a8ab5: Revert I4afaecd8: Avoiding writing sessions for no reason.Jul 31 2015, 11:53 PM

Change 228430 merged by Ori.livneh:
Revert I4afaecd8: Avoiding writing sessions for no reason

https://gerrit.wikimedia.org/r/228430

ori mentioned this in rMW0c82a7a039f5: Revert I4afaecd8: Avoiding writing sessions for no reason.Jul 31 2015, 11:53 PM

• Forrestbot added projects: WMF-deploy-2015-08-04_(1.26wmf17), WMF-deploy-2015-07-28_(1.26wmf16).Aug 1 2015, 12:00 AM

PleaseStand subscribed.Aug 1 2015, 12:12 AM

@Nemo_bis, @Boshomi, @Risker, @RicoRico, @Rich_Farmbrough -- are things better now?

Hello Ori - Since the change you made, I have had NO unexpected losses of sessions. This is a massive improvement, as I was verging toward 90% on checkuserwiki when I had the page open for as little as 2 minutes, and 100% if open more than 10 minutes.

Yeah, 'badtoken' events have gone down very significantly for VisualEditor (and so I imagine even more so for wikitext editor, which has a much worse user outcome when it fails): https://edit-analysis.wmflabs.org/compare/ (narrow the date range to 2015-07-27 – 2015-08-02). Down from ~ 5% (and so 1/3rd of all edit failures) down to 0.9% today.

Should we consider this fixed?

ori closed this task as Resolved.Aug 3 2015, 5:15 AM

ori claimed this task.

Slaporte awarded a token.Aug 3 2015, 5:18 AM

@ori thank you for investing so much time in this !

Glaisher added a project: User-notice.Aug 3 2015, 12:06 PM

Ciencia_Al_Poder mentioned this in T88635: ObjectCacheSessionHandler should avoid pointless writes in write().Aug 3 2015, 12:12 PM

Thank you Ori indeed. :)

• RobLa-WMF subscribed.Aug 4 2015, 9:51 PM

Johan moved this task from To Triage to Announce in next Tech/News on the User-notice board.Aug 6 2015, 4:29 PM

Johan moved this task from Announce in next Tech/News to In current Tech/News draft on the User-notice board.Aug 6 2015, 4:31 PM

gpaumier moved this task from In current Tech/News draft to Recently announced in Tech/News on the User-notice board.Aug 7 2015, 10:22 PM