Page MenuHomePhabricator

Wikis show more and more pages in an old cached version instead of the current one
Closed, ResolvedPublic

Description

There are more and more user on dewiki, which reports that the will see old version when visiting a page or the history of the page

Some time ago there was a big problem with purge requested not processed by the cache server. Is this problem back? I did not found the bug numbers at the moment.
Or is there another problem?


Version: wmf-deployment
Severity: major

Details

Reference
bz54647

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:17 AM
bzimport set Reference to bz54647.

Just to make sure, this is not about redirects is it? (We had an issue with redirects some time ago.)

Am I right that this only happens to users who are not logged in?

According to comments on de.wp [[WP:BYPASS]] in Firefox helps.

greg added a comment.Sep 27 2013, 4:04 PM

cc'ing a couple more people (Bryan and Brad) to ask for them to take a quick look at this one.

It seems we've been having a lot of caching issues lately, for example bug 54513.

What we really need to start with are captures of the HTTP request and response headers that are returning out-of-date content.

10:40 paravoid we had a multicast/htcp outage from the 22nd and until yesterday
10:40 paravoid it was a network issue

orangevolker wrote:

Hi,
"until yesterday": Yes, I notice something has changed. But it's not solved yet completely.
Let me give a summary of my experience:

  • Users logged in see current versions, but users not logged in don't
  • The above mentioned change concerns the following: Initially users not logged in didn't see the current version in the version history list page neither, but now this is the case
  • Finally something that struggles me: Soft and hard reload (like Ctrl + F5 in Firefox) don't help when using IE or Opera. Same is true for even entirely deleting cache and profile paths (testet with Opera 12). Amazingly only Firefox' hard reload (Ctrl + F5) had an effect, namely showing the desired current version of a page. Aaaand: After doing so, browsing that page with IE or Opera also showed the correct current version, regardless of having refreshed the cache again or not.

Strange, isn't it? At least for me. Maybe someone of you tech experts has an explanation.
From my feeling I would suppose that the difference between Firefox' vs Opera 12's hard reloading methods and its effects could lead to an explanation. What do you think?
Liebe Grüße, kind regards from Berlin,
Volker Alexander

orangevolker wrote:

I would like to add following information:

  • The problem doesn't occur with pages which are edited now (testet 2013 09 27, ca. 20:00 +02:00). It occurs with paged edited in the last days.
  • Might be an unnecessary detail: Being logged out after havin been logged in (i. e. with some cookies left, I assume) I see the current version. After erasing cache I see the outdated one.
  • BTW I forgot to mention that I believe to remember we had this or a quite similar problem already. I know it's not helpful this way. Just want to mention if anyone else remembers that better than I do.
  • PerfektesChaos mentioned an http snapshot. I'm not sure what it is, but I browsed "http://de.wikipedia.org/wiki/Convair_F2Y", which is one of those problematic pages, and recorded traffic with Capsa 7 Free, Version 7.4.1, Build 2626. Hope this is helpful.

Weekend starting! Everybody have a nice one!
Volker

orangevolker wrote:

Traffic record of an example of the outdated version issue

Attached:

Looks like we need to send a bunch of purge messages to the cache layer to drop things that were affected by the htcp outage.

I'm working on figuring out how to figure out what those things are. :) If someone already knows how to do this they should, but I'll keep poking at it and try to make some progress.

With Reedy's help we just sent purge messages for 34,447 pages on dewiki and 146,760 pages on enwiki. Hopefully this will make things a little better for anon users.

Writeup of steps taken at https://www.mediawiki.org/wiki/User:BDavis_(WMF)/Notes/Finding_Files_To_Purge.

TL;DR:

  • ran sql against tools-lab slaves to find probably affected pages
  • massaged data into full URLs
  • Reedy sent the purges

The purges that were sent were only for the 0-namespace (main wiki). There are doubtless many other pages that are still stale in varnish. It was also only done for enwiki and dewiki. Other wikis are likely affected as well.

Update: the data that I gave Reedy for purging earlier was malformed. He re-ran the purge with (hopefully) correct data around 2013-09-27T23:15Z.

orangevolker wrote:

I would like to ask something:
How is it possible that Firefox' hard reload has another effect on wiki servers than the one of other browsers?
An empty cache (I mean really deleted, with an entirely new profile) loads a page the same way like a hard reload, I assume. Just to clarify: I consider a soft reload as a reload from browser's cache and a hard reload as a reload from network. So how come that wiki servers are delivering an old version to Opera and IE but a current one to Firefox?
Greetings,
Volker

(In reply to comment #13)

I would like to ask something:
How is it possible that Firefox' hard reload has another effect on wiki
servers
than the one of other browsers?
An empty cache (I mean really deleted, with an entirely new profile) loads a
page the same way like a hard reload, I assume. Just to clarify: I consider a
soft reload as a reload from browser's cache and a hard reload as a reload
from
network. So how come that wiki servers are delivering an old version to Opera
and IE but a current one to Firefox?
Greetings,
Volker

Were you previously logged in on firefox (even if you weren't when you did the test)? Cookies from your login session (or a recent login session, or if you made an edit recently and weren't logged in when doing it) will cause the browser to skip the varnish cache layer.

Otherwise possible coincidence (If you did the IE and safari test first, and did the firefox test last - and if the bug was fixed just before starting the firefox test. However that seems kind of unlikely).

Other then those possible issues, I have no idea

(In reply to comment #13)

How is it possible that Firefox' hard reload has another effect on wiki
servers than the one of other browsers?
An empty cache (I mean really deleted, with an entirely new profile) loads a
page the same way like a hard reload, I assume.

A normal page load is likely to send headers that allow intermediate caches to return a cached version of the page, while a hard reload will probably include headers that instruct intermediate caches not to cache.

In both Firefox and Chromium here a normal page load doesn't send any "Cache-Control" or "Pragma" headers, a normal reload (F5) sends "Cache-Control: max-age=0" (but no "Pragma"), and a hard reload (Ctrl+F5) sends both "Cache-Control: no-cache" and "Pragma: no-cache". Opera 16 seems to do the same thing.

On the other hand, IE 9 and 10 and Opera 12 seem to send a "Cache-Control: no-cache" with no "Pragma" header on any reload (both F5 and Ctrl+F5). I don't know whether that's actually the reason, but it's certainly plausible that Wikimedia's caching layer was only revalidating in response to "Pragma: no-cache" and not "Cache-Control: no-cache".

(In reply to comment #15)

(In reply to comment #13)

How is it possible that Firefox' hard reload has another effect on wiki
servers than the one of other browsers?
An empty cache (I mean really deleted, with an entirely new profile) loads a
page the same way like a hard reload, I assume.

A normal page load is likely to send headers that allow intermediate caches
to
return a cached version of the page, while a hard reload will probably
include
headers that instruct intermediate caches not to cache.
In both Firefox and Chromium here a normal page load doesn't send any
"Cache-Control" or "Pragma" headers, a normal reload (F5) sends
"Cache-Control:
max-age=0" (but no "Pragma"), and a hard reload (Ctrl+F5) sends both
"Cache-Control: no-cache" and "Pragma: no-cache". Opera 16 seems to do the
same
thing.
On the other hand, IE 9 and 10 and Opera 12 seem to send a "Cache-Control:
no-cache" with no "Pragma" header on any reload (both F5 and Ctrl+F5). I
don't
know whether that's actually the reason, but it's certainly plausible that
Wikimedia's caching layer was only revalidating in response to "Pragma:
no-cache" and not "Cache-Control: no-cache".

Interesting. OTOH wget -S --header 'cache-control: no-cache' --header 'pragma: no-cache' 'http://en.wikipedia.org/wiki/Main_Page' has headers that still suggest a cache hit.

orangevolker wrote:

@Brian in comment #14:
No, I wasn't logged in, neither before. I just deleted the cache and startet Firefox again.
And yes, I agree with you. A conicidence seems to be unlikely, especially when we consider that this phenomenon already occured on at least 25th.

@Brad in comment #15:
Thanks for the lesson. I didn't know that and didn't think in this direction neither.

@all:
Would it be much effort to instruct the servers to deliver a cached version on first call and always purge their cache on subsequent calls, whichever method and whichever browser those subsequent calls may be generated?

Volker, the server don't know if you reloaded the page or it's the first time you visit it.

Perhaps those HIT mean that Varnish checked with the apaches and they returned that the page didn't change (and thus they could reuse the stored page) ?

Thanks for the long explanation Brad (I would have had to dig the behaviors). If Varnish is accepting Pragma: no-cahe but not Cache-Control: no-cache they should be fixed: «HTTP/1.1 caches SHOULD treat "Pragma: no-cache" as if the client had sent "Cache-Control: no-cache".»

If that were the case, I would expect that doing hard refresh on firefox of a specific thumbnail, during one of the varnish cache issues would result in the correct thumb being returned, which doesn't happen.

Bawolff? For refreshing thumbnails you need to purge the image page.

(In reply to comment #19)

If that were the case, I would expect that doing hard refresh on firefox of a
specific thumbnail, during one of the varnish cache issues would result in
the
correct thumb being returned, which doesn't happen.

I believe thumbnails are handled differently, in that the underlying file at the requested size is created if it doesn't already exist and is then served for subsequent requests (possibly with more caching in front). Whereas there isn't any "underlying file" for articles, it's regenerated (although not necessarily reparsed) every time a request for it makes it past Varnish/Squid.

I could be mistaken though.

It is possible to install a monitoring for the htcp purge requests?
It take some days to be found and that means many readers gets old versons, that is bad.

A cron job with a script is needed, that pings from the web server to the cache server to look, if that network connection is still active and send mails or give another warning, when it is not.
Or a other way to found outage here. Thanks.

(In reply to comment #22)

It is possible to install a monitoring for the htcp purge requests?

That was more or less bug 43449 - Brandon might be able to elaborate.

(In reply to comment #23)

(In reply to comment #22)

It is possible to install a monitoring for the htcp purge requests?

That was more or less bug 43449 - Brandon might be able to elaborate.

There are graphs in ganglia (look at vhtcpd_packets_deqeued metric for the various varnish servers. It should be an increasing slope. During the outage it was a flat line). I've heard talk of also making the metric be a delta instead of total number of purges ever and making alarm bells go off for that metric instead of the current silent recording, but I don't know the status of either idea.

(In reply to comment #24)

(In reply to comment #23)

(In reply to comment #22)

It is possible to install a monitoring for the htcp purge requests?

That was more or less bug 43449 - Brandon might be able to elaborate.

There are graphs in ganglia (look at vhtcpd_packets_deqeued metric for the
various varnish servers. It should be an increasing slope. During the outage
it
was a flat line). I've heard talk of also making the metric be a delta
instead
of total number of purges ever and making alarm bells go off for that metric
instead of the current silent recording, but I don't know the status of
either
idea.

Hmm looks like somebody did the delta thing (yay). Also I got the metric name wrong. I meant http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=vhtcpd_inpkts_dequeued&s=by+name&c=Upload+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

Change 86883 had a related patch set uploaded by BryanDavis:
New maintenance script to purge edited pages.

https://gerrit.wikimedia.org/r/86883

Tim suggested that a new maintenance script be developed to make recovering from this type of HTCP outage/network partition easier. https://gerrit.wikimedia.org/r/86883 is my attempt at creating said script.

My intent is that it could be used something like:

mwscript purgeEditedFiles.php --starttime 20130922000000 \
  --endtime 20130926000000 \
  --htcp-dest $IP_OF_AMS_RELAY \
  --wiki $SOME_AFFECTED_WIKI

Thanks; broadening the subject, as de.wiki is currently the only one (with en.wiki) where the problem was fixed and the script should allow recovering from it on all other wikis too.

Change 86883 merged by jenkins-bot:
Add script to purge changed pages from cache

https://gerrit.wikimedia.org/r/86883

Change 88757 had a related patch set uploaded by BryanDavis:
Add script to purge changed pages from cache

https://gerrit.wikimedia.org/r/88757

Change 88757 merged by jenkins-bot:
Add script to purge changed pages from cache

https://gerrit.wikimedia.org/r/88757

Hmm. At this point we are not that far away from the point in time where things will naturally fall out of cache (i believe there is a maxage of 30 days)

(In reply to comment #32)

Hmm. At this point we are not that far away from the point in time where
things will naturally fall out of cache (i believe there is a maxage of
30 days)

I have a deploy window reserved starting at 2013-10-16T20:00:00Z to purge additional pages. It has taken longer than I would have liked to get to this point, but on the positive side we are getting tools in place that will make dealing with this sort of event easier in the future.

The job finished at 2013-10-17T02:43Z. All in all we sent HTCP purges
for a little over 1.6M pages in esams caches.

orangevolker wrote:

Hi,
it just happened again. To be able to see the updated version in IE, I had to use Firefox and do a hard reload. Only then IE showed the recent version. It's like before, with emptied caches and so on.

orangevolker wrote:

Here are two examples:
https://de.wikipedia.org/wiki/Massenmedien
and:
https://de.wikipedia.org/wiki/Massendrucksache
Edited as user (being logged in) with Opera. Afterwards viewed with other browsers, namely Opera, IE and Firefox, and not logged in.
The problem is back. Only I don't know the reason. What do you think?
Greetings,
Volker

forked to a new bug: bug 56545

reclosing