htcp cache purges for images do not seem to clear europe upload squid caches
Closed, ResolvedPublic

Description

+++ This bug was initially created as a clone of Bug #41130 comment 87 +++

Splitting off from bug 41130 comment 87 and 88.

It appears that the htcp purges are not clearing the upload squids in esams. It appeared to briefly be working, but is broken again.

Furthermore, it appears that non-image related htcp purges are working fine (So articles are getting purged properly). I did the equivalent procedure described below but with an article, and everything worked fine.

Examples of broken files include http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Wappen_Landkreis_Aurich.svg/140px-Wappen_Landkreis_Aurich.svg.png


Steps to reproduce:

  1. Find a file that is in the caches (note, its ok if the correct version is in the cache). In this example lets take http://upload.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg (I'm using the svg source, to ensure that this bug is not possibly mixed up with bug 44428)
  1. Get the file, accessing from both esams and eqiad. Note the http headers, specifically the age header.

To test from esams I used the command:
wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.esams.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

(Occasionally a varnish responded to that, which didn't have the described problem, If that happens just make the request again).

This had an age header like:
Age: 753820
Which is ~8 days

To test from eqiad I used the command (Actually, I originally just used upload.wikimedia.org since I'm in north america, but I'm trying to make the steps to reproduce generic):
wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.eqiad.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

It had an age header of
Age: 11199

  1. Purge the image description page: http://commons.wikimedia.org/wiki/File:Wappen_Landkreis_Aurich.svg?action=purge

Expected behaviour:

Doing

wget -S -U bawolff  --header 'host: upload.wikimedia.org' 'http://upload-lb.esams.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

and

wget -S -U bawolff  --header 'host: upload.wikimedia.org' 'http://upload-lb.eqiad.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

Should both result in responses that either have cache miss headers, or an age header that is very small.

Actual behavior:

North america ( wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.eqiad.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg' ) gets the expected response:

[..]
Age: 0
X-Cache: cp1030 miss (0), cp1029 frontend miss (0)

Europe ( wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.esams.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg' ) does not have the cache cleared

[..]
Age: 754915
X-Cache: HIT from sq84.wikimedia.org
X-Cache: MISS from amssq62.esams.wikimedia.org
X-Cache: MISS from amssq61.esams.wikimedia.org

In previous tests I did, things worked yesterday, and there have been some users reporting success here and there, which suggests this issue is either intermittent or recently broken. However, I did try this procedure several times, so its not just that my packet happened to be lost (Unless a significant portion of the htcp packets are being lost).

Several users have reported that broken thumbs are magically fixed one out of every so often page loads. I think that has to do with sometimes a varnish server responds from esams, and it doesn't appear to have the cache clearing issue (the headers say in that case that the cache hit is from cp1030, which also seems to respond from eqiad, so I'm guessing that server is actually in eqiad(?). This is going quite a bit beyond my knowladge of wmf's network layout).


Version: wmf-deployment
Severity: critical

bzimport set Reference to bz44508.
Bawolff created this task.Via LegacyJan 30 2013, 4:22 PM
Aklapper added a comment.Via ConduitJan 30 2013, 8:24 PM

bawolff: Thanks for the great initial analysis, and isolating it from the noise in bug 41130. I really really appreciate this.

For the records, I've also copied this to RT# 4454 as told on operations IRC.

RobLa-WMF added a comment.Via ConduitJan 30 2013, 9:33 PM

Hi folks, Leslie is looking into the problem now. Brian, thanks for the detailed analysis (and for splitting this out)!

bzimport added a comment.Via ConduitJan 30 2013, 9:39 PM

lcarr wrote:

I should never have let people know I exist ;)

multicast forwarding is working between dobson and hooft and a lot of purges are being sent -- therefore the problem is either with upload varnishes not sending or with the purging in esams itself. I'm checking these as the next two steps

bzimport added a comment.Via ConduitJan 30 2013, 9:58 PM

lcarr wrote:

cp3009 had varnishhtcpd locked up - restarted

bzimport added a comment.Via ConduitJan 30 2013, 10:28 PM

lcarr wrote:

the varnish upload frontends are properly purging cache now - the squid frontends are not. investigating this

bzimport added a comment.Via ConduitJan 30 2013, 10:48 PM

lcarr wrote:

the eqiad mediawiki boxes did not have permissions to send purges to the tampa and esams squids, and therefore those purges were not happening. this was semi obscured by the varnish issues. But now, all should be complete. Please doulbe check and let me know.

Also thank you very much for the very detailed steps for reproduction. These kinds of well written bugs are very very helpful :)

Leslie

Nemo_bis added a comment.Via ConduitJan 30 2013, 10:52 PM

The thumb and original in the example are definitely fixed. Wonderful!

Bawolff added a comment.Via ConduitJan 30 2013, 10:59 PM

I just tested Moscow_metro_map_ru_sb_future.svg doing both the svg and the png thumbnail. The caches appear to be purged all around.

Hip, Hip Hurrah! Thank you for the quick response.

Aklapper added a comment.Via ConduitJan 31 2013, 10:51 AM

423px thumbnail for example http://commons.wikimedia.org/wiki/File:Wappen_Landkreis_Aurich.svg also constantly shows the new version now here in Europe.

Awesome. Thanks so much to Leslie and bawolff!

I'm closing this as RESOLVED FIXED.

McZusatz added a comment.Via ConduitFeb 2 2013, 6:16 PM
Nemo_bis added a comment.Via ConduitFeb 2 2013, 6:33 PM

(In reply to comment #10)

Problem seems to persist for
https://upload.wikimedia.org/wikipedia/commons/1/18/Kochendes_wasser02.jpg .
At
least for me

Same here, but I purged it a few times and it seems ok now, doesn't it?

Aklapper added a comment.Via ConduitFeb 4 2013, 4:21 PM

(In reply to comment #10)

Problem seems to persist for
https://upload.wikimedia.org/wikipedia/commons/1/18/Kochendes_wasser02.jpg

Works for me now (probably thanks to Nemo).

McZusatz added a comment.Via ConduitMar 19 2013, 11:54 PM

New instance: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/PlacaenmonumentoaMarconi.jpg/120px-PlacaenmonumentoaMarconi.jpg

"Broken" response from esams:

HTTP request sent, awaiting response...

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Tue, 19 Mar 2013 23:35:35 GMT
Content-Type: image/jpeg
Content-Length: 3637
Connection: keep-alive
X-Object-Meta-Sha1base36: q311dfj910y0loxw51q2fm79rw8j32o
Content-Disposition: inline;filename*=UTF-8''PlacaenmonumentoaMarconi.jpg
X-Timestamp: 1363609261.89105
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Last-Modified: Mon, 18 Mar 2013 12:21:01 GMT
ETag: b737ec5a44b9f7975b69facfde655da0
X-Cache: HIT from sq82.wikimedia.org
X-Cache-Lookup: HIT from sq82.wikimedia.org:3128
Age: 41471
X-Cache: HIT from amssq59.esams.wikimedia.org
X-Cache-Lookup: HIT from amssq59.esams.wikimedia.org:3128
X-Cache: MISS from knsq19.esams.wikimedia.org
X-Cache-Lookup: MISS from knsq19.esams.wikimedia.org:80
Via: 1.0 sq82.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq59.esams.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 knsq19.esams.wikimedia.org:80 (squid/2.7.STABLE9)

Length: 3637 (3.6K) [image/jpeg]

"Valid" response via Varnish:

HTTP request sent, awaiting response...

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Tue, 19 Mar 2013 23:35:24 GMT
Content-Type: image/jpeg
Content-Length: 3994
Connection: keep-alive
X-Object-Meta-Sha1base36: qtiacup2waiwupgx7sma7puadwpqeig
Content-Disposition: inline;filename*=UTF-8''PlacaenmonumentoaMarconi.jpg
Last-Modified: Tue, 19 Mar 2013 23:24:09 GMT
Etag: 92f154cb562f20cf648199f1bd88c464
X-Timestamp: 1363735449.88825
X-Varnish: 2316312654, 2904564637, 670145209
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Accept-Ranges: bytes
Age: 0
X-Cache: cp1028 miss (0), cp3009 miss (0), cp3009 frontend miss (0)
Access-Control-Allow-Origin: *

Length: 3994 (3.9K) [image/jpeg]

Please confirm. (I already purged the file page)

Aklapper added a comment.Via ConduitMar 20 2013, 9:15 AM

Marco: Thanks for reporting this and the initial analysis. As this bug report was closed about six weeks ago it might be a different technical issue this time, so I've moved the last comment to a separate bug report: bug 46350.

Add Comment