Maniphest T119038

Image cache issue when 'over-writing' an image on commons
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	Stemoc
	Nov 19 2015, 5:29 AM

Description

Recently i have seen that when you upload over a previous image, the changes don't seem to show up (for the picture atleast) [the resolution and size remain current], i thought it was an issue with my browser in the beginning but then i tried with 4 other browser on my PC and they all came looking the same as the previous image, The thumbnail on the file history may look updated or may not looked updated, it depends, here are two examples,

I made a major change to this image (https://commons.wikimedia.org/wiki/File:Kim_Jong-Un_Photorealistic-Sketch.jpg) but it doesn't look changed at all, unless you look at the thumbnail below

I tried this on another image, a higher resolution one and as previously, "ctrl +f5" multiple times but it still did not update to the "less pinkish" recent version (https://commons.wikimedia.org/wiki/File:Bernie_Sanders_September_2015.jpg)

An admin on commons told me he had issue like this as well and another user on IRC told me to change the pixels by 1 to get the "current" version" of the image...is this a known issue?..

I'm on FF43b1 on WinVista ..just in case..

Details

	Subject	Repo	Branch	Lines +/-
	Revert "upload purging: do not listen on text/mobile addr"	operations/puppet	production	+1 -1
	Revert "purging: do not VCL-filter on domain regex"	operations/puppet	production	+15 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	aaron	T97562 WANObjectCache relay daemon or mcrouter support
Resolved	Ottomata	T123954 Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters.
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	aaron	T124418 Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
Resolved	• ema	T122881 Install XKey vmod
Resolved	• ema	T131499 Upgrade all cache clusters to Varnish 4
Resolved	• ema	T126206 Upgrade to Varnish 4: things to remember
Resolved	• ema	T128788 Port varnishlog.py to new VSL API
Resolved	• ema	T131353 Port remaining scripts depending on varnishlog.py to new VSL API
Resolved	• ema	T131501 Convert misc cluster to Varnish 4
Resolved	• ema	T134989 WDQS empty response - transfer clsoed with 15042 bytes remaining to read
Resolved	• ema	T131502 Convert upload cluster to Varnish 4
Resolved	BBlack	T131761 Solve large-object/stream/pass/chunked in upload cluster better
Resolved	• ema	T142076 Analyze Range requests on cache_upload frontend
Resolved	• ema	T142233 Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections
Resolved	• ema	T131503 Convert text cluster to Varnish 4
Resolved	BBlack	T135696 Sort out vcl_deliver vs vcl_synth mess with v4 VCL
Resolved	• ema	T150660 Post Varnish 4 migration cleanup
Resolved	daniel	T102476 RFC: Requirements for change propagation
Resolved	• GWicke	T84923 Reliable publish / subscribe event bus
Resolved	Ottomata	T88459 Implementing the reliable event bus using Kafka
Invalid	Ottomata	T110748 Event Bus
Resolved	Ottomata	T110750 Investigate improving Confluent REST Proxy and Schema Registry for Event Bus
Resolved	Ottomata	T114443 EventBus MVP
Resolved	RobH	T114191 Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
Resolved	Ottomata	T121553 setup kafka1001 & kafka1002
Resolved	• Cmjohnson	T121578 Rack 8 new misc servers
Resolved	elukey	T121558 setup kafka2001 & kafka2002
Resolved	RobH	T120885 codfw: rack 8 new misc systems
Resolved	• mobrovac	T116247 Define edit related events for change propagation
Resolved	Eevans	T116786 Integrate eventbus-based event production into MediaWiki
Declined	• csteipp	T120133 security review of ramsey/uuid
Resolved	• csteipp	T120212 Security review of EventBus extension
Resolved	• GWicke	T120409 RESTBase should honor wiki-wide deletion/suppression of users
Resolved	• ssastry	T125266 Remove user name and edit comment from html <head>
Resolved	• Pchelolo	T122079 Update EventBus extension to produce User-block events
Resolved	Ottomata	T122077 Define schema for a User-block event
Resolved	Ottomata	T118578 Package EventLogging and dependencies for Jessie
Resolved	Ottomata	T118761 Move EventLogging/server to its own repo and set up CI
Resolved	Ottomata	T118780 Puppetize eventlogging-service
Resolved	• madhuvishy	T118903 Make eventlogging logs configurable via python config file [5 pts] {oryx}
Resolved	Ottomata	T118863 Deploy eventlogging from new repository [5 pts]
Resolved	• madhuvishy	T118869 Send HTTP stats about eventlogging-service to statsd [3 pts]
Resolved	Ottomata	T121112 Build tornado-sprocket python packages
Resolved	• mobrovac	T128463 New Service Request - Change Propagation
Resolved	• mobrovac	T130948 Scap3 promote stage not working
Resolved	BBlack	T124954 Decrease max object TTL in varnishes
		Restricted Task
Resolved	• ema	T249325 cache_text cluster consistently backlogged on purge requests
Open	None	T252127 Improve resource-purge request_id and dt propagation
Resolved	CDanis	T256446 monitoring & alerting for purged

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ori set Security to None.Nov 19 2015, 5:33 AM

Legoktm subscribed.Nov 19 2015, 5:53 AM

Stemoc updated the task description. (Show Details)Nov 19 2015, 5:56 AM

zhuyifei1999 subscribed.Nov 19 2015, 10:05 AM

Aklapper added a project: MediaWiki-File-management.Nov 19 2015, 10:05 AM

Restricted Application added a project: Multimedia. · View Herald TranscriptNov 19 2015, 10:05 AM

zhuyifei1999 moved this task from Incoming to MediaWiki Interface and i18n on the Commons board.Nov 19 2015, 10:05 AM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptNov 19 2015, 10:05 AM

I can confirm this issue again in:

File:Faro_card_game.jpg -- reverts did not update "original file"
Help desk thread on File:Continuity_of_the_Exponential_at_0.svg in this case only png thumbs work as expected. "original file" fail.

Possible regression?

(Setting priority to high as this issue affect a wide range of reuploads)

zhuyifei1999 moved this task from MediaWiki Interface and i18n to Thumbnail and file renderings on the Commons board.Nov 20 2015, 1:36 PM

Issue likely to be related to varnish cache (with X-Wikimedia-Debug test):

$ curl https://upload.wikimedia.org/wikipedia/commons/b/be/Continuity_of_the_Exponential_at_0.svg 2> /dev/null | md5sum
f9b13ad5cc4d5a44b0f2554570097186  -
$ curl -H 'X-Wikimedia-Debug: 1' https://upload.wikimedia.org/wikipedia/commons/b/be/Continuity_of_the_Exponential_at_0.svg 2> /dev/null | md5sum
5c7493f4df7baa9d9e4a15107245d709  -

RP88 subscribed.Nov 20 2015, 1:48 PM

zhuyifei1999 added subscribers: Revent, Natuur12, Bawolff, Tsungam.Nov 20 2015, 2:24 PM

faidon edited projects, added Traffic; removed Varnish.Nov 20 2015, 2:49 PM

faidon added a subscriber: BBlack.

Steinsplitter awarded a token.Nov 20 2015, 2:49 PM

ori removed ori as the assignee of this task.Nov 20 2015, 2:51 PM

faidon raised the priority of this task from High to Unbreak Now!.Nov 20 2015, 2:51 PM

the vhtcpd stats look weird to me ( http://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Upload%20caches%20eqiad&r=month&st=1448034519&host_regex= ). Right now vhtcpd_inpkts_sane seems to be hovering in the single digits, where last week it was between 100-200. And on monday there seemed to be some giant spike.
(Edit: I realize now, that might be a reasonable result of splitting up upload and normal htcp streams)

Anyways, unless there was some change in measurement methodology, that doesn't look right to me.

Given the timing of this all, I wonder if it has something to do with moving upload htcp to its own ip ( 290639a96e381d8 )

Perhelion awarded a token.Nov 20 2015, 10:13 PM

Perhelion subscribed.

Denniss subscribed.Nov 20 2015, 11:13 PM

More complains here: https://commons.wikimedia.org/wiki/Commons:Forum#Irgendwas_spinnt_.28Cache.29 (just FYI)

In T119038#1820653, @Bawolff wrote:

the vhtcpd stats look weird to me ( http://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Upload%20caches%20eqiad&r=month&st=1448034519&host_regex= ). Right now vhtcpd_inpkts_sane seems to be hovering in the single digits, where last week it was between 100-200. And on monday there seemed to be some giant spike.
(Edit: I realize now, that might be a reasonable result of splitting up upload and normal htcp streams)

Anyways, unless there was some change in measurement methodology, that doesn't look right to me.

Yeah, the rates actually did look "correct" in the sense that when filtering only for upload purges, the packet rate in the original combined stream was very low compared to the much higher rate of text purges. That was one of the drivers for the split, to avoid sending all of the pointless text-cluster purges to the numerous upload caches.

We should try to reproduce this reliably and test per-datacenter on the commandline (using curl with --resolve to point at specific upload-lb.$site.wikimedia.org).

In T119038#1823034, @BBlack wrote:

In T119038#1820653, @Bawolff wrote:

the vhtcpd stats look weird to me ( http://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Upload%20caches%20eqiad&r=month&st=1448034519&host_regex= ). Right now vhtcpd_inpkts_sane seems to be hovering in the single digits, where last week it was between 100-200. And on monday there seemed to be some giant spike.
(Edit: I realize now, that might be a reasonable result of splitting up upload and normal htcp streams)

Anyways, unless there was some change in measurement methodology, that doesn't look right to me.

Yeah, the rates actually did look "correct" in the sense that when filtering only for upload purges, the packet rate in the original combined stream was very low compared to the much higher rate of text purges. That was one of the drivers for the split, to avoid sending all of the pointless text-cluster purges to the numerous upload caches.

We should try to reproduce this reliably and test per-datacenter on the commandline (using curl with --resolve to point at specific upload-lb.$site.wikimedia.org).

I tried testing for https://commons.wikimedia.org/wiki/File:2014_ebola_virus_epidemic_in_West_Africa_simplified.svg yesterday. I think i tested all data centres and none of the data centres were getting purges

I observed this effect today on two images from multi-page DjVu file: https://commons.wikimedia.org/wiki/File:Henryk_Sienkiewicz_-_Wiry_01.djvu :

New version of the file has been uploaded yesterday.

I resolved this problem purging manually the appropriate file pages on Commons few times:
https://commons.wikimedia.org/w/index.php?title=File%3AHenryk_Sienkiewicz_-_Wiry_01.djvu&page=146&action=purge
https://commons.wikimedia.org/w/index.php?title=File%3AHenryk_Sienkiewicz_-_Wiry_01.djvu&page=147&action=purge
Single purge did not help (maybe I should wait longer for the effect or maybe it behaves a bit randomly?)

Just FYI: While working on Wikisource with Commons-based DjVu files I observe this effect for years. It likely happens when a new file version is uploaded while thumbnails from old version are being generated or extensively used on other wikis. And it may be also correlated with long time response from commons servers (high load?). But unsure.

Glaisher subscribed.Nov 22 2015, 4:03 AM

Akoopal subscribed.Nov 22 2015, 7:26 PM

• MZMcBride subscribed.Nov 22 2015, 7:30 PM

So I was just helping someone with a thumb caching issue on irc -

https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg/220px-Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg was returning an old cached thumb on esams (but not other clusters). Purging it worked to clear the file, so perhaps the issue in this bug is intermittent, or maybe there's just a lot of packet loss around.

Bug again reported here: https://commons.wikimedia.org/wiki/Commons:Bistro#File:La_derni.C3.A8re_charrette_de_Thermidor.jpg for https://commons.wikimedia.org/wiki/File:La_derni%C3%A8re_charrette_de_Thermidor.jpg

intracer subscribed.Nov 25 2015, 9:01 AM

Antanana subscribed.Nov 25 2015, 10:53 AM

Apparently its impossible to download the over written version of the image unless you change the pixels and the "Original file" link will download the previous version instead ..

BurritoBazooka subscribed.Nov 27 2015, 1:06 PM

It seems this image is also affected: https://commons.wikimedia.org/wiki/File:Glasl's_Model_of_Conflict_Escalation.svg

I came here through here: https://commons.wikimedia.org/wiki/Commons:Graphics_village_pump#Stuck_rendering_on_this_SVG_file.3F

And the problem is mentioned here: https://en.wikipedia.org/wiki/User_talk:Swinnall

zhuyifei1999 awarded a token.Nov 27 2015, 2:05 PM

Bug is still being reported. Latest incident: https://commons.wikimedia.org/wiki/Commons:Help_desk#Deletion_of_old_version.2C_possible_caching_problem

& https://commons.wikimedia.org/wiki/Commons:Village_pump#Problems_in_new_file_version

Southparkfan subscribed.Nov 28 2015, 2:52 PM

Change 255807 had a related patch set uploaded (by BBlack):
Revert "purging: do not VCL-filter on domain regex"

https://gerrit.wikimedia.org/r/255807

Change 255808 had a related patch set uploaded (by BBlack):
Revert "upload purging: do not listen on text/mobile addr"

https://gerrit.wikimedia.org/r/255808

Change 255807 merged by BBlack:
Revert "purging: do not VCL-filter on domain regex"

https://gerrit.wikimedia.org/r/255807

Change 255808 merged by BBlack:
Revert "upload purging: do not listen on text/mobile addr"

https://gerrit.wikimedia.org/r/255808

Above I've done some partial reversion of the multiple commits involved in the multicast split work as an experiment.

At this point, MediaWiki will still be splitting HTCP to separate addresses for the text/mobile and upload sets, but the original domain filters are back in place and upload is once again listening to the text/mobile multicast address in addition to its own. If this experiment doesn't resolve the issue (for new upload->purge; it won't fix existing issues), the next step would be to revert the mediawiki-config commit https://gerrit.wikimedia.org/r/#/c/249121/ . If that also doesn't resolve it, then it's mostly likely unrelated to the multicast changes.

No further reports of this bug AFAIK. Bug presumably resolved.

zhuyifei1999 reassigned this task from zhuyifei1999 to BBlack.Nov 30 2015, 10:36 AM

See https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time%27s_magazine_11.jpg and there may be many more of them
https://commons.wikimedia.org/wiki/File:John_Lennon_Rock%27n_Roll_-_Hamburg.jpg is another one

In T119038#1837890, @Denniss wrote:

See https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time%27s_magazine_11.jpg and there may be many more of them
https://commons.wikimedia.org/wiki/File:John_Lennon_Rock%27n_Roll_-_Hamburg.jpg is another one

https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time's_magazine_11.jpg <= reverted to wrong version? :P
https://commons.wikimedia.org/wiki/File:John_Lennon_Rock%27n_Roll_-_Hamburg.jpg <= Did you try clearing browser cache? I'm finding nothing wrong here. Maybe geographic difference loading cached image from different datacenters, or simply browser cached old version.

Regardless, even if the bug appears "resolved" now (at least, back to normal rare levels of occurrence), my changes were just experimental - we'll still need to track down what the real cause is and then put real fixes in place that still allow the traffic to be split.

geographic location may play a role here but then we would have a data sync issue somewhere. The thumb from first file is still wrong (multiple browsers tested, purge had no effect), full screen version may be OK. The second image seems to be resolved.
This is actually a major problem if a purge command is not passed through at all.

For me, the first one linked above in zhuyifei's comment looks like this:

Screenshot_2015-11-30_15-25-33.png (833×1 px, 565 KB)

(the picture contents on the right in the above screenshot has a license noted here) >.>

Thumbnail is wrong.

BBlack moved this task from Backlog to Traffic team actively servicing on the Traffic board.Nov 30 2015, 6:03 PM

For 388px size of https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time's_magazine_11.jpg

esams was returning wrong version (I've since purged that file, so you can't see anymore). The age header was set to
94485 (26.2 hours ago) [At time of writing its Mon Nov 30 20:26:02 UTC]. Thus this is an example that happened after the revert in question went through.

For the 77px size of the same image - https://upload.wikimedia.org/wikipedia/commons/thumb/6/61/Jennifer_Winget_at_the_launch_of_Watch_Time%27s_magazine_11.jpg/77px-Jennifer_Winget_at_the_launch_of_Watch_Time%27s_magazine_11.jpg

esams was showing wrong version with Age: 96131; X-Cache: cp1050 hit (3), cp3038 hit (40), cp3033 frontend miss (0); Last-Modified: Sun, 29 Nov 2015 18:02:03 GMT. codfw was showing wrong version with: Age: 96141; X-Cache: cp1050 hit (1), cp2014 hit (9), cp2014 frontend miss (0); Last-Modified: Sun, 29 Nov 2015 18:02:03 GMT. Note that the version being incorrectly displayed was uploaded at 2015-11-29T17:51:53 UTC, and the later version which should have overridden it, was uploaded at 2015-11-29T20:32:24‎. Thus a race condition seems unlikely

[subsequently, the cache has been purged]. ulsfo and eqiad were returning correct versions.

All this sounds like there is just exceptionally high packet loss on 239.128.0.113 relative to 239.128.0.112 (Otherwise, why is it so intermittent?). Given we go to all the trouble to give htcp packets sequential ids per host, it would be nice if we could set something up to keep track of how many ids are received, so we could quanitfy rate of packet loss.

Well anything's possible, but AFAIK there's no special treatment of 239.128.0.113 vs .112 in terms of multicast routing. We've always faced some small percentage of loss due to dropped UDP packets, but for the moment that's considered acceptable vs the costs of re-engineering a more-reliable mechanism for transmitting purges... but if anything that situation should have improved with the multicast split, as they're now queued separately on the sending and receiving host sockets.

The other general issue we face (both in historically and now) is the cache layering race: these images are stored in multiple layers of cache (2-3, depending on the datacenter of the request), and no attempt is currently made to order the invalidations to happen in the bottom layers first, so it's possible for an upper layer to process an invalidation and then refresh the image from a lower layer before said lower layer has processed the same invalidation, and thus end up with the wrong image back in cache.

The split of the multicast addresses could also potentially have negatively affected those races... perhaps being queued with (interleaved with) the higher-volume text invalidation traffic slowed them down enough in certain cases to make it less an issue. The best option right now would be to actually trace the HTCP traffic in response to manual purges and debug it from there, which I'll try to do this week...

The other general issue we face (both in historically and now) is the cache layering race: these images are stored in multiple layers of cache (2-3, depending on the datacenter of the request), and no attempt is currently made to order the invalidations to happen in the bottom layers first, so it's possible for an upper layer to process an invalidation and then refresh the image from a lower layer before said lower layer has processed the same invalidation, and thus end up with the wrong image back in cache.

Sample size of 1, but the last-modified timestamps on 77px of File:Jennifer_Winget_at_the_launch_of_Watch_Time's_magazine_11.jpg don't seem consistent with that happening. [I've ammended my previous comment to include the last-modified header]

Thibaut120094 awarded a token.Nov 30 2015, 10:21 PM

I received a user complaint about https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/PL_Karol_May_-_Dolina_%C5%9Amierci.djvu/page191-1024px-PL_Karol_May_-_Dolina_%C5%9Amierci.djvu.jpg - esams indeed had an older version it was serving (other caching centers were ok). The odd thing was, I had to do the purge 3 time before the purge seemed to go through for esams (And the reporter indicated that s/he had tried purging too, to no effect).

BBlack mentioned this in T119016: Find and fix duplicate CDN purges from MediaWiki.Dec 1 2015, 12:50 AM

So, I did some tracing this morning using one of the files mentioned earlier in this ticket. The commons file in question is: https://commons.wikimedia.org/wiki/File:Continuity_of_the_Exponential_at_0.svg . You'll note on the page it lists the upload.wm.o links for the original as well as scaled 4x different widths of "thumbs" at:

I pulled all of these up in a browser and they're all valid and working, and at least visually appear to have the same data currently.
Then I did a purge of this file using the browser URL https://commons.wikimedia.org/wiki/File:Continuity_of_the_Exponential_at_0.svg?action=purge while tracing HTCP network traffic on cp1071.eqiad.wmnet (an upload cache which is local to eqiad, so there should be no routing loss issues to even think about) several times.

In my first several purges, I observed the following HTCP (note commons URLs on .112 and upload URLs on .113 as expected with the split):

U 10.64.32.45:59104 -> 239.128.0.112:4827
  .q...k....$.....HEAD.Mhttp://commons.wikimedia.org/wiki/File:Continuity_of_the_Exponential_at_0.svg..HTTP/1.0....

U 10.64.32.45:59104 -> 239.128.0.112:4827
  ..........$.....HEAD.ihttp://commons.wikimedia.org/w/index.php?title=File:Continuity_of_the_Exponential_at_0.svg&action=history..HTTP/1.0....

U 10.64.32.45:33277 -> 239.128.0.113:4827
  ..........$.....HEAD..https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Continuity_of_the_Exponential_at_0.svg/1000px-Continuity_of_the_Exponential_at_0.svg.png..HTTP/1.0....

U 10.64.32.45:51868 -> 239.128.0.113:4827
  .~...x....$.....HEAD.Zhttps://upload.wikimedia.org/wikipedia/commons/b/be/Continuity_of_the_Exponential_at_0.svg..HTTP/1.0....

Then a few minutes later when I tried several times again, I got all of the above but without even the 1000px variant (just 2x commons + 1x upload HTCP for the main image itself). None of the purges ever sent an HTCP for the 200px, 500px, or 2000px variants. So I can definitely see how action=purge isn't going to fix all broken thumbnails - whether by bug or by design, it doesn't even attempt to purge them all, and that has nothing to do with the multicast split.

If we assume uploading a new image runs the same code as action=purge for purging the upload.wm.o image and thumbs, then the same problem is going to exist there as well...

Testing further, there's currently a complaint on commons about: https://commons.wikimedia.org/wiki/File:Rolamite.svg at https://commons.wikimedia.org/wiki/Commons:Help_desk#New_uploaded_version_of_.SVG_file_won.27t_become_the_current_version - the scaled image shown there on the commons page itself as a preview has the upper wheel on the right, whereas the correct image (and all the scaled upload links below the preview) have the upper wheel on the left. action=purge gives these HTCP:

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....|....K-....HEAD.^https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/1000px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K.....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/100px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K/....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/120px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K0....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/180px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....|....K1....HEAD.^https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/2000px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K2....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/200px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K3....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/220px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K4....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/240px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K5....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/500px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K6....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/768px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:56915 -> 239.128.0.113:4827
  .d...^....K7....HEAD.@https://upload.wikimedia.org/wikipedia/commons/f/f5/Rolamite.svg..HTTP/1.0....

U 10.64.48.90:44306 -> 239.128.0.112:4827
  .W...Q....K8....HEAD.3http://commons.wikimedia.org/wiki/File:Rolamite.svg..HTTP/1.0....

U 10.64.48.90:44306 -> 239.128.0.112:4827
  .s...m....K9....HEAD.Ohttp://commons.wikimedia.org/w/index.php?title=File:Rolamite.svg&action=history..HTTP/1.0....

Which seems to be covering a lot of thumbs, but not the thumb that's actually being used as the preview image on commons, which is: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/512px-Rolamite.svg.png .

Trying two more times immediately afterwards, I only got:

U 10.64.48.90:45347 -> 239.128.0.113:4827
  .d...^....Kd....HEAD.@https://upload.wikimedia.org/wikipedia/commons/f/f5/Rolamite.svg..HTTP/1.0....

U 10.64.48.90:46624 -> 239.128.0.112:4827
  .W...Q....Ke....HEAD.3http://commons.wikimedia.org/wiki/File:Rolamite.svg..HTTP/1.0....

U 10.64.48.90:46624 -> 239.128.0.112:4827
  .s...m....Kf....HEAD.Ohttp://commons.wikimedia.org/w/index.php?title=File:Rolamite.svg&action=history..HTTP/1.0....

U 10.64.32.42:55295 -> 239.128.0.112:4827
  .W...Q..........HEAD.3http://commons.wikimedia.org/wiki/File:Rolamite.svg..HTTP/1.0....

U 10.64.32.42:55295 -> 239.128.0.112:4827
  .s...m..........HEAD.Ohttp://commons.wikimedia.org/w/index.php?title=File:Rolamite.svg&action=history..HTTP/1.0....

U 10.64.32.42:44759 -> 239.128.0.113:4827
  .....{..........HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/320px-Rolamite.svg.png..HTTP/1.0....

U 10.64.32.42:45333 -> 239.128.0.113:4827
  .d...^..........HEAD.@https://upload.wikimedia.org/wikipedia/commons/f/f5/Rolamite.svg..HTTP/1.0....

Thinking this through logically - either there's a code/data bug in how we're purging thumbnails in the general case at the application level, or the remaining infrastructure-level possibility is that the outbound purge packets from the appserver which is processing my action=purge request are overflowing a socket buffer and simply unreliable/lossy on the outbound side (it wouldn't be the inbound side I don't think). My gut feeling from observation is that it's the former though, as I don't see full thumbnail purge lists even occasionally coming through in repeated tests.

Digging through some of our repos looking for a possible related code regression as a shot in the dark, I do see a few code change that could be peripherally related depending on the real deployment date of the merge to production wikis:

https://gerrit.wikimedia.org/r/#/c/252696/
https://gerrit.wikimedia.org/r/#/c/251201/

In general, I don't fully understand how we're getting from action=purge to HTCP packets for specific thumbnails - is there someone on the MW side who understands it deeply. Black box observation indicates that we're saving a list of generated thumbnail sizes somewhere (memcache? mysql?), and we're maybe tracking timestamps on when they were last generated and/or purged so that we don't do duplicated purge work, which might be working against us here? The fact that the purged list seems to shrink as I test leans that direction - we should probably always purge all known thumbs, not just ones we think haven't already been purged since the last change...

I've also staged a revert of the mediawiki-config side of the multicast split at https://gerrit.wikimedia.org/r/#/c/256730/ . With the current pair of reversions on the puppet/cache side of things, it's fine to apply this one as well (but they'll have to be re-applied in correct order later if we go back again). I still don't think this is the direct cause, but it may be worth trying this going back to the theory that basically thumbnail purging has always been slightly-broken and mostly-saved by timing coincidences, and that the split affected those timing coincidences significantly. If we want to try this, I think it's fair game to go ahead and merge->deploy it regardless of current freeze(s)...

The list of thumbnails to purge comes from the list of thumbnails in swift. Since these are also deleted on purge, the next ?action=purge will have less thumbnails (only ones recreated since then).

Lokal_Profil subscribed.Dec 3 2015, 10:16 PM

In T119038#1849717, @aaron wrote:

The list of thumbnails to purge comes from the list of thumbnails in swift. Since these are also deleted on purge, the next ?action=purge will have less thumbnails (only ones recreated since then).

That seems problematic since swift and varnish are caching thumbnails separately. Is Swift also listening to HTCP, or does it get the purge via some other mechanism? Either way, the nasty issue here is then that if the purge succeeds on Swift and fails on varnish, further purges will never try to remove the invalid ones still stuck in varnish.

And to expound a bit earlier on the cache-layering problem (and Swift is another layer to consider!). When a thumbnail is generated in swift and broadly accessed by users through the various cache endpoints, the places in which it's cached and the diagram of which caches fetch from which sources looks like this:

slnqlmsyQg4Ei8bS_5lbgFA.png (720×960 px, 38 KB)

Each of those boxes with varnish-fe or varnish-be on them also represent multiple hosts: in the fe case, different users hit different equivalent fe hosts with separate caches. In the be case, there's hashing by-filename, so all users would hit the same host for the same image. In all cases, a cache miss for any box in the diagram would immediately re-fetch the image from the next connected thing down on the chart.

So, if a purge request is broadcast to swift and all varnishes simultaneously, immediately, and only once, we have a big race condition here. If the image was already cached in all of those places, and the invalidation arrives at varnish-fe (esams), then a user fetches the image from varnish-fe (esams) before the invalidation has been processed at varnish-be (esams), then varnish-fe (esams) will re-fetch the outdated image and keep it long after the invalidation has finished processing at all the other layers. This is true anytime invalidate->re-fetch-for-user happens at any layer before the invalidation completes at a lower layer. The only way to avoid the race is to purge sequentially from the bottom up in the diagram...

In T119038#1852376, @BBlack wrote:

In T119038#1849717, @aaron wrote:

The list of thumbnails to purge comes from the list of thumbnails in swift. Since these are also deleted on purge, the next ?action=purge will have less thumbnails (only ones recreated since then).

That seems problematic since swift and varnish are caching thumbnails separately. Is Swift also listening to HTCP, or does it get the purge via some other mechanism? Either way, the nasty issue here is then that if the purge succeeds on Swift and fails on varnish, further purges will never try to remove the invalid ones still stuck in varnish.

Yeah that's a very known suckiness in the system. The VCL bucketing based thumbnail purges gilles worked on would solve the listing sync problem.

Swift just gets DELETEs via HTTP, not the UDP purger.

We can try the mediawiki-config revert during one of the Monday SWATs I think ( https://gerrit.wikimedia.org/r/#/c/256730/ ) - it may still have some pattern-effect that reduce the rate back down.

https://commons.wikimedia.org/wiki/File:Megabalanus_coccopoma.jpg
Another file to run experiments with - Logo was removed but still shows in bottom right on many thumbs.

More reports on WP:VP/T keep coming in.

The revert was pushed off for tomorrow, because SWAT today was too full of post-freeze changes...

This was merged in the SWAT that happened around 00:00-01:00 UTC Dec 8 (so about 17 hours before this post). Any new reports or observations of bad behavior since then (note this won't fix existing broken thumbnails, but may affect the rate of breakage for new uploads / overwrites).

Seems to be behaving correctly for me now. Should probably wait a bit more before closing?

Thank you for fixing this :D

We shouldn't close at all, as these fixes are just workarounds and don't get to the heart of the issue, but regardless I do want to wait for more reports for another day or so before calling this a good temporary workaround.

BBlack lowered the priority of this task from Unbreak Now! to High.Dec 9 2015, 8:30 PM

Happened again today: https://commons.wikimedia.org/wiki/File:KutlugAtaman.JPG
Was overwritten, reverted by me and intermediate file deleted. Despite multiple purges the deleted image is still shown although some thumbs either show the most recent version (or have not been updated from first overwrite).

@Denniss - problems today are unrelated, they're from general random purge loss due to: T124418

Denniss unsubscribed.Apr 24 2016, 10:59 AM

Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptApr 24 2016, 10:59 AM

BBlack mentioned this in T133821: Make CDN purges reliable.Apr 28 2016, 12:08 AM

BBlack added a subtask: T133821: Make CDN purges reliable.Apr 28 2016, 12:10 AM

BBlack closed this task as a duplicate of T133821: Make CDN purges reliable.May 31 2016, 9:26 PM

Liuxinyu970226 awarded a token.Jan 11 2018, 11:10 AM

BBlack closed subtask T133821: Make CDN purges reliable as Resolved.Sep 29 2020, 9:26 PM

	F3029094: Screenshot_2015-11-30_15-25-33.png
	Nov 30 2015, 3:30 PM

Image cache issue when 'over-writing' an image on commonsClosed, DuplicatePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Image cache issue when 'over-writing' an image on commons
Closed, DuplicatePublic
Actions

Related Objects
Search...