Page MenuHomePhabricator

Unexplained edit token errors
Closed, ResolvedPublic

Description

@TheDJ reports frequent edit token errors on the web interface:

11:43 < thedj> ehm, guys, i'm getting a lot of session loss warnings when trying to save after preview... 
11:47 < thedj> anomie: en.wp
13:27 < thedj> i can already see that both browser windows have the same enwiki session id, so i suspect there was no new session cookie
13:28 < thedj> tgr: centralauth tokens were also the same in both windows
13:33 < thedj> tgr: failed again. session cookies unchanged
13:33 < thedj> can't see if there is a header, since Safari filters that out in it's inspector view... <head> </desk>
13:41 < thedj> tgr: right. so it definetly seems that the api preview is causing it...
13:41 < thedj> if i have the problem and do save (error), save again, it works.
13:43 < thedj> but if I do save (error), preview, save again, then it seems to error again as well
13:45 < thedj> tgr: or not. 
13:45 < thedj> no, it happens all the time. and it seems to be a lot worse now than it was before.
13:46 < thedj> oh wait, i also have livepreview on of course, so it actually previews 'all the time' if i have the edit window open.
13:47 < thedj> this also explains why no everyone is noticing it as bad as i am

(heavily edited)

Visible on graphite as well:
http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1455047738.237&lineMode=connected&target=MediaWiki.edit.failures.session_loss.count

editing_session_loss.png (308×586 px, 22 KB)

(big spike is T124440)

Timing matches 17:26 <elukey> mc1004.eqiad put back into redis/memcached pool from SAL.

Related Objects

Event Timeline

Tgr raised the priority of this task from to Needs Triage.
Tgr updated the task description. (Show Details)
Tgr added a project: MediaWiki-Page-editing.
Tgr added subscribers: Tgr, TheDJ, Anomie, bd808.

So, today @elukey installed mc1004 upgrading it to jessie, and the jessie redis standard config ships with

bind 127.0.0.1

and for this reason redis was unreachable. Probably @TheDJ's session was hashed to the "shard4" bucket and kept failing over and over.

Joe triaged this task as High priority.Feb 9 2016, 11:49 PM
Joe added a project: SRE.
Joe set Security to None.

I see the error count has normalized since 12:30, so I guess my manual action (I disabled puppet on the server and added a 'bind 0.0.0.0' rule by hand) had a positive effect. I'll close the ticket tomorrow morning if no more reports happen.

Change 269616 had a related patch set uploaded (by Giuseppe Lavagetto):
role::memcached: explicitly bind redis to 0.0.0.0

https://gerrit.wikimedia.org/r/269616

Change 269616 merged by Giuseppe Lavagetto:
role::memcached: explicitly bind redis to 0.0.0.0

https://gerrit.wikimedia.org/r/269616

@TheDJ reports his issues are over, it's extremely likely the problem had to do with this misconfiguration. Resolved.

Let's also set the resolved state in that case :)