Page MenuHomePhabricator

htcp cache purges for images do not seem to clear europe upload squid caches
Closed, ResolvedPublic

Description

+++ This bug was initially created as a clone of Bug #41130 comment 87 +++

Splitting off from bug 41130 comment 87 and 88.

It appears that the htcp purges are not clearing the upload squids in esams. It appeared to briefly be working, but is broken again.

Furthermore, it appears that non-image related htcp purges are working fine (So articles are getting purged properly). I did the equivalent procedure described below but with an article, and everything worked fine.

Examples of broken files include http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Wappen_Landkreis_Aurich.svg/140px-Wappen_Landkreis_Aurich.svg.png


Steps to reproduce:

  1. Find a file that is in the caches (note, its ok if the correct version is in the cache). In this example lets take http://upload.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg (I'm using the svg source, to ensure that this bug is not possibly mixed up with bug 44428)
  1. Get the file, accessing from both esams and eqiad. Note the http headers, specifically the age header.

To test from esams I used the command:
wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.esams.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

(Occasionally a varnish responded to that, which didn't have the described problem, If that happens just make the request again).

This had an age header like:
Age: 753820
Which is ~8 days

To test from eqiad I used the command (Actually, I originally just used upload.wikimedia.org since I'm in north america, but I'm trying to make the steps to reproduce generic):
wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.eqiad.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

It had an age header of
Age: 11199

  1. Purge the image description page: http://commons.wikimedia.org/wiki/File:Wappen_Landkreis_Aurich.svg?action=purge

Expected behaviour:

Doing

wget -S -U bawolff  --header 'host: upload.wikimedia.org' 'http://upload-lb.esams.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

and

wget -S -U bawolff  --header 'host: upload.wikimedia.org' 'http://upload-lb.eqiad.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg'

Should both result in responses that either have cache miss headers, or an age header that is very small.

Actual behavior:

North america ( wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.eqiad.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg' ) gets the expected response:

[..]
Age: 0
X-Cache: cp1030 miss (0), cp1029 frontend miss (0)

Europe ( wget -S -U bawolff --header 'host: upload.wikimedia.org' 'http://upload-lb.esams.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg' ) does not have the cache cleared

[..]
Age: 754915
X-Cache: HIT from sq84.wikimedia.org
X-Cache: MISS from amssq62.esams.wikimedia.org
X-Cache: MISS from amssq61.esams.wikimedia.org

In previous tests I did, things worked yesterday, and there have been some users reporting success here and there, which suggests this issue is either intermittent or recently broken. However, I did try this procedure several times, so its not just that my packet happened to be lost (Unless a significant portion of the htcp packets are being lost).

Several users have reported that broken thumbs are magically fixed one out of every so often page loads. I think that has to do with sometimes a varnish server responds from esams, and it doesn't appear to have the cache clearing issue (the headers say in that case that the cache hit is from cp1030, which also seems to respond from eqiad, so I'm guessing that server is actually in eqiad(?). This is going quite a bit beyond my knowladge of wmf's network layout).


Version: wmf-deployment
Severity: critical

Details

Reference
bz44508

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 1:23 AM
bzimport set Reference to bz44508.

bawolff: Thanks for the great initial analysis, and isolating it from the noise in bug 41130. I really really appreciate this.

For the records, I've also copied this to RT# 4454 as told on SRE IRC.

Hi folks, Leslie is looking into the problem now. Brian, thanks for the detailed analysis (and for splitting this out)!

lcarr wrote:

I should never have let people know I exist ;)

multicast forwarding is working between dobson and hooft and a lot of purges are being sent -- therefore the problem is either with upload varnishes not sending or with the purging in esams itself. I'm checking these as the next two steps

lcarr wrote:

cp3009 had varnishhtcpd locked up - restarted

lcarr wrote:

the varnish upload frontends are properly purging cache now - the squid frontends are not. investigating this

lcarr wrote:

the eqiad mediawiki boxes did not have permissions to send purges to the tampa and esams squids, and therefore those purges were not happening. this was semi obscured by the varnish issues. But now, all should be complete. Please doulbe check and let me know.

Also thank you very much for the very detailed steps for reproduction. These kinds of well written bugs are very very helpful :)

Leslie

The thumb and original in the example are definitely fixed. Wonderful!

I just tested Moscow_metro_map_ru_sb_future.svg doing both the svg and the png thumbnail. The caches appear to be purged all around.

Hip, Hip Hurrah! Thank you for the quick response.

423px thumbnail for example http://commons.wikimedia.org/wiki/File:Wappen_Landkreis_Aurich.svg also constantly shows the new version now here in Europe.

Awesome. Thanks so much to Leslie and bawolff!

I'm closing this as RESOLVED FIXED.

(In reply to comment #10)

Problem seems to persist for
https://upload.wikimedia.org/wikipedia/commons/1/18/Kochendes_wasser02.jpg .
At
least for me

Same here, but I purged it a few times and it seems ok now, doesn't it?

(In reply to comment #10)

Problem seems to persist for
https://upload.wikimedia.org/wikipedia/commons/1/18/Kochendes_wasser02.jpg

Works for me now (probably thanks to Nemo).

New instance: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/PlacaenmonumentoaMarconi.jpg/120px-PlacaenmonumentoaMarconi.jpg

"Broken" response from esams:

HTTP request sent, awaiting response...

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Tue, 19 Mar 2013 23:35:35 GMT
Content-Type: image/jpeg
Content-Length: 3637
Connection: keep-alive
X-Object-Meta-Sha1base36: q311dfj910y0loxw51q2fm79rw8j32o
Content-Disposition: inline;filename*=UTF-8''PlacaenmonumentoaMarconi.jpg
X-Timestamp: 1363609261.89105
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Last-Modified: Mon, 18 Mar 2013 12:21:01 GMT
ETag: b737ec5a44b9f7975b69facfde655da0
X-Cache: HIT from sq82.wikimedia.org
X-Cache-Lookup: HIT from sq82.wikimedia.org:3128
Age: 41471
X-Cache: HIT from amssq59.esams.wikimedia.org
X-Cache-Lookup: HIT from amssq59.esams.wikimedia.org:3128
X-Cache: MISS from knsq19.esams.wikimedia.org
X-Cache-Lookup: MISS from knsq19.esams.wikimedia.org:80
Via: 1.0 sq82.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq59.esams.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 knsq19.esams.wikimedia.org:80 (squid/2.7.STABLE9)

Length: 3637 (3.6K) [image/jpeg]

"Valid" response via Varnish:

HTTP request sent, awaiting response...

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Tue, 19 Mar 2013 23:35:24 GMT
Content-Type: image/jpeg
Content-Length: 3994
Connection: keep-alive
X-Object-Meta-Sha1base36: qtiacup2waiwupgx7sma7puadwpqeig
Content-Disposition: inline;filename*=UTF-8''PlacaenmonumentoaMarconi.jpg
Last-Modified: Tue, 19 Mar 2013 23:24:09 GMT
Etag: 92f154cb562f20cf648199f1bd88c464
X-Timestamp: 1363735449.88825
X-Varnish: 2316312654, 2904564637, 670145209
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Accept-Ranges: bytes
Age: 0
X-Cache: cp1028 miss (0), cp3009 miss (0), cp3009 frontend miss (0)
Access-Control-Allow-Origin: *

Length: 3994 (3.9K) [image/jpeg]

Please confirm. (I already purged the file page)

Marco: Thanks for reporting this and the initial analysis. As this bug report was closed about six weeks ago it might be a different technical issue this time, so I've moved the last comment to a separate bug report: bug 46350.