Page MenuHomePhabricator

Wrong sidebar cached on sites
Closed, ResolvedPublic

Description

Some pages on fr.wikisource display the wrong sidebar for some users (looks like the default, instead of the sidebar defined at https://fr.wikisource.org/wiki/MediaWiki:Sidebar). I always see the good one when logged in, and the bad one when logged out. @Yann can sometimes see the bad one when logged in (not sure about the details).

My screenshots (page: https://fr.wikisource.org/wiki/Page:Tolstoï_-_Le_salut_est_en_vous.djvu/55):

Good when logged inBad when logged out

Yann's screenshots:

GoodBad

Details

Related Gerrit Patches:

Event Timeline

matmarex created this task.Apr 19 2016, 5:15 PM
Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptApr 19 2016, 5:15 PM

I'm seeing similar issues on https://wikimediafoundation.org/wiki/Home. The sidebar was the default version for both anon and authed. A page purge didn't seem to make a difference. I did a null edit of https://wikimediafoundation.org/wiki/MediaWiki:Sidebar which seemed to fix things for the authed use-case.

Now I have a strange reproduction case for anons:

Now I have a strange reproduction case for anons:

x-cache:cp2016 hit(1), cp4017 hit(2), cp4016 frontend hit(10)

server:mw2196.codfw.wmnet
x-cache:cp2023 hit(1), cp4016 miss(0), cp4016 frontend hit(6)

server:mw2181.codfw.wmnet
x-cache:cp2019 hit(2), cp4017 miss(0), cp4016 frontend hit(13)

server:mw2118.codfw.wmnet
x-cache:cp2016 hit(1), cp4017 hit(2), cp4016 frontend hit(11)

The different x-cache header on a hard reload feels a bit like there may be Varnish cache pollution for redir pages.

Purging https://wikimediafoundation.org/w/index.php?title=Questions_for_Wikimedia%3F&redirect=no fixed that one instance. This is easily explained by the fact that [[Questions_for_Wikimedia?]] and [[Answers]] are separate cache objects in Varnish. There are likely many other instances of this (pages that were cached in Varnish when the parser cache representation of the sidebar was incorrect).

Still unknown is the root cause of the default sidebar being seen. Speculation runs towards parser cache corruption, but that is not yet confirmed. If it was a parser cache it also seems unlikely that it only effected [[MediaWiki:Sidebar]]. Other local message overrides at least would also be suspect.

Addshore added a subscriber: Addshore.EditedApr 19 2016, 6:29 PM

Vaguely similar, the link to the main page for wikidata.org now links to the incorrect place.

See https://www.wikidata.org/wiki/Wikidata_talk:Main_Page#Sidebar

Location should be:
https://www.wikidata.org/wiki/Wikidata:Main_Page
but is instead:
https://www.wikidata.org/wiki/Main_Page

I think at some point MessageCache failed and started returning the defaults for everything, which ended up getting cached by other things (sidebar cache, varnish, etc.). Some relevant discussion in #wikimedia-tech.

Joe added a subscriber: Joe.Apr 19 2016, 6:51 PM

So, during the switchover we first wiped the codfw memcached clean, then when moving the traffic over we had a temporary overload of the externalstorage cluster (es* servers) that resulted in quite a few errors. Talking on irc with @Legoktm it seems confirmed that MessageCache would return the default value if a) memcached returns a miss b) externalstorage fails to return the custom data.

Mentioned in SAL [2016-04-19T19:06:39Z] <legoktm> purging sidebar cache across all wikis (T133069)

The script completed in about 4 minutes. Now we need a varnish purge for every page cached after the switchover till my script finished.

Mentioned in SAL [2016-04-19T19:53:14Z] <paravoid> staggered varnish bans for 'obj.http.server ~ "^mw2.+"' as a workaround for T133069

BBlack added a subscriber: BBlack.Apr 19 2016, 8:32 PM

So, during the switchover we first wiped the codfw memcached clean, then when moving the traffic over we had a temporary overload of the externalstorage cluster (es* servers) that resulted in quite a few errors. Talking on irc with @Legoktm it seems confirmed that MessageCache would return the default value if a) memcached returns a miss b) externalstorage fails to return the custom data.

My $0.02 on this is we really should change the behavior of this code going forward, at whatever layer this is going wrong. Somewhere (perhaps multiple somewheres?) some code is hiding a failure and then injecting defaults in place of the custom data that it failed to fetch. IMHO, we'd be better served by passing on the error as a 500 in a case like this.

mark added a subscriber: mark.
faidon triaged this task as Unbreak Now! priority.Apr 19 2016, 9:10 PM
faidon added a subscriber: faidon.

Varnish bans for obj.http.server ~ ^mw2.+ were gradually deployed over the course of the past hour, so caches for that issue should be "purged". Other than that... fully agreed with Brandon; this shouldn't have happened in the first place and should be fixed before Thursday (day of the switchback).

Restricted Application added a subscriber: Urbanecm. · View Herald TranscriptApr 19 2016, 9:10 PM
Addshore renamed this task from Wrong sidebar cached? on fr.wikisource to Wrong sidebar cached? on sites.Apr 20 2016, 9:54 AM
Addshore renamed this task from Wrong sidebar cached? on sites to Wrong sidebar cached on sites.
Joe added a comment.Apr 20 2016, 11:18 AM

So, during the switchover we first wiped the codfw memcached clean, then when moving the traffic over we had a temporary overload of the externalstorage cluster (es* servers) that resulted in quite a few errors. Talking on irc with @Legoktm it seems confirmed that MessageCache would return the default value if a) memcached returns a miss b) externalstorage fails to return the custom data.

My $0.02 on this is we really should change the behavior of this code going forward, at whatever layer this is going wrong. Somewhere (perhaps multiple somewheres?) some code is hiding a failure and then injecting defaults in place of the custom data that it failed to fetch. IMHO, we'd be better served by passing on the error as a 500 in a case like this.

Either that or (my own two cents) we make the "faulty response" uncachable or cachable for a very short time (as this will happen during outages/switchovers, typically, and that would help with the load). It amounts to emitting the correct cache headers from MediaWiki if fetching data fails completely.

Sidebar is built by Skin::buildSidebar:

  • Skin::buildSidebar()
    • cached in WANObjectCache.
    • computed by Skin::addToSidebar() using wfMessage()->inContentLanguage()->plain() to fetch the sidebar configuration from an interface message. As for any interface message, it can be overridden via an on-wiki MediaWiki-namespace page by the same name.
  • wfMessage -> Message::fetchMessage -> MessageCache::get -> MessageCache::getMessageFromFallbackChain -> MessageCache::getMessageForLang -> MessageCache->getMsgFromNamespace.
  • MessageCache->getMsgFromNamespace
    • cached in WANObjectCache.
    • computed with Revision::getContent().
    • If it returns null, then a "temporary load failure" is assumed. All callers only support "string" and "boolean false" returns. (FIXME) It's unclear to me how this null results in the default being used.
    • If it returns false, then MessageCache will keep looking for other languages and eventually software default.
    • If it returns string, the message is considered found and returned.
  • Revision::getContent() returns null if:
    • If revision is accessible to public (e.g. revision delete/oversight).
    • If getContentInternal/loadText() returns false.
    • If getContentInternal/loadText() returns null.
    • If getContentInternal/ContentHandler::unserializeContent() fails.
Restricted Application added a subscriber: Luke081515. · View Herald TranscriptApr 20 2016, 4:42 PM

Change 284512 had a related patch set uploaded (by Aaron Schulz):
Make MessageCache handle lock timeouts better

https://gerrit.wikimedia.org/r/284512

[...]* wfMessage -> Message::fetchMessage -> MessageCache::get -> MessageCache::getMessageFromFallbackChain -> MessageCache::getMessageForLang -> MessageCache->getMsgFromNamespace.[...]

I can't resist! http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom-of-nouns.html

Change 284512 merged by Ori.livneh:
Make MessageCache handle lock timeouts better

https://gerrit.wikimedia.org/r/284512

Change 284696 had a related patch set uploaded (by Ori.livneh):
Make MessageCache handle lock timeouts better

https://gerrit.wikimedia.org/r/284696

Change 284696 merged by Ori.livneh:
Make MessageCache handle lock timeouts better

https://gerrit.wikimedia.org/r/284696

aaron closed this task as Resolved.Apr 21 2016, 3:44 PM
aaron claimed this task.