Page MenuHomePhabricator

beta cluster: deployment-cache-upload02 does not seem to purge content when getting PURGE
Closed, ResolvedPublic

Description

Problem encountered by Ally on the Beta Cluster and reported on the glamtools mailing-list

Quoting her email:


When I’ve uploaded larger files (2500px) to the Beta cluster, the image re-sizes to the smaller, 500px version whenever I try to view or download the image at full resolution, despite all indications being that the current version should be the larger format.

I’m not sure whether this is an issue with the Beta cluster or with the Toolset, or if it only has to do with uploading subsequent versions of files.

Example link: http://commons.wikimedia.beta.wmflabs.org/wiki/File:Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg


Full thumbnailing is possible though:
http://upload.beta.wmflabs.org/wikipedia/commons/thumb/c/ca/Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg/2000px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg


Version: unspecified
Severity: normal
URL: http://lists.wikimedia.org/pipermail/glamtools/2014-May/000180.html

Details

Reference
bz65683

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:17 AM
bzimport set Reference to bz65683.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 65684 has been marked as a duplicate of this bug. ***

Its a stale entry in varnish cache http://upload.beta.wmflabs.org/wikipedia/commons/c/ca/Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg?break-the-cache gives you the real version.

Unclear if this is an issue with gwtoolset not sending cache purges, issues with varnish being set up wrong on beta labs (has been the case before), intermittent htcp packet drop, etc

After some testing:

*Does not appear to be a gwtoolset problem, but a beta cluster configuration problem. This issue will happen on the live site.

*text varnishes (deployment-cache-text.*) seem to recieve htcp purges fine
*upload varnishes (deployment-cache-upload02) do not seem to recieve any purges. I also tested for thumbnails as well as the original asset

*Does not appear to be a gwtoolset problem, but a beta cluster configuration
problem. This issue will happen on the live site.

If I get you correctly, you mean it will *not* happen on the live site, right? :)

The purge destination is configured in operations/mediawiki-config.git in wmf-config/squid-labs.php :

$wgHTCPRouting = array(

'|^https?://upload\.beta\.wmflabs\.org|' => array(
    'host' => '10.68.17.51',  # deployment-cache-upload02
    'port' => 4827,
),

...

I tried a manual purge with:

http://upload.beta.wmflabs.org/wikipedia/commons/thumb/c/ca/Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg/800px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg

Ie:

mwdeploy@deployment-bastion:~$ mwscript purgeList.php
http://upload.beta.wmflabs.org/wikipedia/commons/thumb/c/ca/Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg/800px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg
Purging 1 urls
Done!
mwdeploy@deployment-bastion:~$

On varnish side:

deployment-cache-upload02.eqiad.wmflabs 25 2014-05-23T17:56:19 0.000068188 127.0.0.1 -/204 0 PURGE http://upload.beta.wmflabs.org/wikipedia/commons/thumb/c/ca/Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg/800px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg - - - - vhtcpd
$

Maybe gwtoolset hit some code which ends up not sending purge request? We might be able to find out the debug log if purge requests are logged there.

[mid-air collision]
Hmm, vhtcpd seems to at least be getting some packets: http://ganglia.wmflabs.org/latest/graph_all_periods.php?c=deployment-prep&h=deployment-cache-upload02&r=hour&z=default&jr=&js=&st=1400866550&event=hide&ts=0&v=7816&m=vhtcpd_inpkts_sane&vl=pkts&ti=Sane%20packets&z=large Maybe varnish is rejecting the purge request or something. I couldn't find where the config files were for varnish on beta are.

(In reply to Jean-Fred from comment #4)

*Does not appear to be a gwtoolset problem, but a beta cluster configuration
problem. This issue will happen on the live site.

If I get you correctly, you mean it will *not* happen on the live site,
right? :)

I think its extremely likely to be a configuration issue on beta cluster that would *not* affect the live sites. From what I understand the configuration of upload varnishes on beta differ from the real ones. My only basis for this claim though is previous experience with caching issues specific to beta cluster, and that the beta config seems rather different from the main config for the upload cache.

On varnish side:

deployment-cache-upload02.eqiad.wmflabs 25 2014-05-23T17:56:19 0.000068188
127.0.0.1 -/204 0 PURGE
http://upload.beta.wmflabs.org/wikipedia/commons/thumb/c/ca/
Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg/800px-
Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg - - - - vhtcpd
$

bawolff@Bawolff-L:~$ wget -S 'http://upload.beta.wmflabs.org/wikipedia/commons/thumb/c/ca/Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg/800px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg'
--2014-05-23 15:12:42-- http://upload.beta.wmflabs.org/wikipedia/commons/thumb/c/ca/Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg/800px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg
Resolving upload.beta.wmflabs.org (upload.beta.wmflabs.org)... 208.80.155.136
Connecting to upload.beta.wmflabs.org (upload.beta.wmflabs.org)|208.80.155.136|:80... connected.
HTTP request sent, awaiting response...

HTTP/1.1 200 OK
Server: nginx/1.1.19
Content-Type: image/jpeg
X-Powered-By: PHP/5.3.10-1ubuntu3.10+wmf1
X-Wikimedia-Thumb: http://10.68.16.16/w/thumb.php?f=Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg&width=800
X-Varnish: 435262255 435256949, 1820643391 1820643388
Via: 1.1 varnish, 1.1 varnish
Content-Length: 114849
Accept-Ranges: bytes
Date: Fri, 23 May 2014 18:12:06 GMT
Age: 5016
Connection: keep-alive
X-Cache: deployment-cache-upload02 hit (12), deployment-cache-upload02 frontend hit (1)
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Age, Content-Length, Date, X-Cache, X-Varnish

Length: 114849 (112K) [image/jpeg]
Saving to: `800px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg'

100%[======================================>] 114,849 541K/s in 0.2s

2014-05-23 15:12:42 (541 KB/s) - `800px-Imaginative_depiction_of_the_completed_Forth_Rail_Bridge.jpg' saved [114849/114849]

In particular note the Age header ( Age: 5016 ). Unless you did that purge 83 minutes ago, it didn't work. I suspect there's some acl in the varnish config discarding the purge (That's a total guess)

Maybe gwtoolset hit some code which ends up not sending purge request? We
might be able to find out the debug log if purge requests are logged there.

Not just gwtoolset. Normal mediawiki purging wasn't working either.

Purge requests are sent to the debug log via wfDebugLog with the 'squid' log name.

Purging the upload.beta.wmflabs.org URLs definitely send PURGE requests to varnish. I double checked it via varnishncsa on both frontend and backend cache.

For some reason, it doesn't see to be purge since I get hits:

X-Cache: deployment-cache-upload02 hit (16), deployment-cache-upload02 frontend hit (3)

I have absolutely no clue. Will need to poke ops about it I guess.

Changing title to reflect the HTCP part of purging is working fine, its what varnish does with the purges that is apparently the problem

greg added a subscriber: BBlack.

cc'ing Brandon per:

Changing title to reflect the HTCP part of purging is working fine, its what varnish does with the purges that is apparently the problem

Brandon: Can you take a look at this and let us know what's going on? :)

greg set Security to None.
greg added a subscriber: Milimetric.

Reopening as nothing was done here. (cc @Cmcmahon @Milimetric)

Fix going in here, assuming it works: https://gerrit.wikimedia.org/r/#/c/177576/

(also, apologize for the delay, but relatedly: is our normal workflow to CC and wait for people to claim? If you think something's logically mine to look at, please assign it to me!)

Fix going in here, assuming it works: https://gerrit.wikimedia.org/r/#/c/177576/

Thanks!

(also, apologize for the delay, but relatedly: is our normal workflow to CC and wait for people to claim? If you think something's logically mine to look at, please assign it to me!)

I try not to assign things to people in other teams too much :) But in the future I'll assign it all to you :P (Seriously, I'll assign when we need your help.)

@JeanFred / @Bawolff: can you confirm the fix works as expected?

BBlack mentioned this in Unknown Object (Diffusion Commit).Dec 10 2014, 8:37 PM

spoke to Brandon on IRC, as far as anyone can tell this is now working correctly