Invalidation of Varnish thumbnail cache sometimes doesn't work
Closed, InvalidPublic

Older changes are hidden. Show older changes.
bzimport set Reference to bz41130.Via ConduitNov 22 2014, 12:56 AM
bzimport created this task.Via LegacyOct 17 2012, 8:27 PM
MZMcBride added a comment.Via ConduitOct 17 2012, 9:42 PM

I'm CC'ing a few people who may be able to help. This seems Swift-related.

MZMcBride added a comment.Via ConduitOct 17 2012, 9:54 PM

In discussion with paravoid on IRC, there seems to be three separate issues going on here:

(1) ?action=purge on the file description page no longer regenerates thumbnails [does this have a bug?]

(2) reuploading isn't properly purging thumbnails (bug 31680)

(3) some image links are returning HTTP error code 404 (not found) HTML in a HTTP response code 200 (OK) reply [does this have its own bug?]

Aklapper added a comment.Via ConduitOct 18 2012, 3:33 PM

It looks like bug 41113 has the best description and analysis of the issue so far, hence marking as a duplciate.

  • This bug has been marked as a duplicate of bug 41113 ***
bzimport added a comment.Via ConduitOct 20 2012, 6:54 PM

mr.heat wrote:

Bug 31680 is unsolved for over a year. A year. I assume hundreds of users are wasting their time with purging and reuploading broken images for over a year now. I don't do uploads very often but almost every time this damn bug gets in my way.

This image is wrong for two weeks now.

http://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Mediawiki-versionsvergleich.png/640px-Mediawiki-versionsvergleich.png

Bug 31680 is assigned to the wrong developers. Its marked as a MediaWiki bug but it obviously is a problem with the Wikimedia servers. Thats why I posted it again.

Is anybody working on this problem?

Aklapper added a comment.Via ConduitOct 23 2012, 1:02 AM

(In reply to comment #2)

In discussion with paravoid on IRC, there seems to be three separate issues
going on here:

MZMcBride: Thanks for taking the time to analyze & split this up. It's welcome.

(1) ?action=purge on the file description page no longer regenerates thumbnails
[does this have a bug?]

None that I know of.

(2) reuploading isn't properly purging thumbnails (bug 31680)

(3) some image links are returning HTTP error code 404 (not found) HTML in a
HTTP response code 200 (OK) reply [does this have its own bug?]

(3) is covered by bug 41113 it seems.

This leaves us with (1) which this bug report is now about. Adjusting summary to make this clear.
Also decreasing priority and severity - while this is highly annoying, this is nothing to immediately interrupt work on any other tasks (plus it's not that *all* thumbnails were broken or such). I'm not trying to belittle the problem, just putting it into the bigger picture.

RobLa-WMF added a comment.Via ConduitOct 23 2012, 7:20 PM

Aaron, can you look at this one, and let us know if we need to get ops involved in fixing this one?

Aklapper added a comment.Via ConduitOct 23 2012, 8:05 PM

(In reply to comment #6)

Aaron, can you look at this one, and let us know if we need to get ops involved
in fixing this one?

Taking the bigger picture into account, this seems all kind of related when it comes to the outcome for users (and their comments on random feedback places): https://bugzilla.wikimedia.org/buglist.cgi?bug_id=41130,41174,31680,41113

aaron added a comment.Via ConduitOct 23 2012, 8:27 PM

Doing a purge and monitoring the swift logs shows no DELETE issued for 640px-Mediawiki-versionsvergleich.png (though it does for a good number of other sizes). After HEAD/GETing the object from swift (which triggered its creation in swift), I then purged the file again. This purged the squids and fixed that problem.

The general problem is the following:

  • A user does ?action=purge or uploads a new file
  • MW gets the list of thumbnails in swift
  • MW purges those thumbnails out of swift
  • MW purges the squids for those thumbnails but fails for some/all (maybe due to some random network problem)
  • A user purges the pages again
  • MW gets a list of the thumbnails in swift (which does not include the bad ones the user sees since they *were* purged from swift, just not the squids)
  • MW purges those thumnails out of swift
  • MW purges the squids for those thumbnails (which does not include the bad ones)
  • User is still confused...
aaron added a comment.Via ConduitOct 24 2012, 1:24 AM

This isn't really easy to prevent. One can do a few things:
a) Change SquidPurgeClientPool to return some status info on run() that checks if any of the sockets where marked as "down" (got a non 200/404 message). Then change the squid purge code to be first, and to not purge anything in swift unless the squid purge succeeded. This would still be vulnerable to race conditions (you can purge afterwards again but have the same issue as before).
b) Make the same SquidPurgeClientPool change as (a) and have a list of thumbnails logged somewhere if anything failed to purge. This list could be reused by ?action=purge or job runners automatically re-trying in the background or something.
c) Hack squid to support prefix PURGE requests somehow.
d) Have fixed thumbnails sizes so there are a reasonable number of a priori urls to purge, so ?action=purge can fix failed purges.

One could also change the squid settings to revalidate more using HEADs to swift (which would trigger 404 handling and a fresh thumbnail in this broken case where they don't exist). This would increase latency and traffic, and still wouldn't deal with the immediate confusion upon some re-uploads.

Another work around to some extent is to reduce the network problems (amount of dropped packets? htcp purge rely script?) or purge requests failures so this at least occurs less often. It might be useful to have Mark look at this.

bzimport added a comment.Via ConduitOct 24 2012, 8:16 AM

mr.heat wrote:

(In reply to comment #8)

  • MW purges the squids for those thumbnails but fails for some/all

I'm sure this is a stupid question but why don't you simply purge *ALL* thumbnail sizes for a given image? At least this is what should happen from a users point of view. I'm doing purge for a file. Not for a (incomplete) list of thumbnails.

And why is it not possible to purge a thumbnail by appending ?action=purge to it? This would be a nice workaround in such cases as long as the underlying problem is not fixed.

As said in the other bug I'm spending $100 to the foundation if you fix this longstanding bug.

mark added a comment.Via ConduitOct 24 2012, 9:42 AM

See http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/064011.html for a related idea that would also help improve this.

TheDJ added a comment.Via ConduitOct 25 2012, 10:09 AM

@TMg: It's not a stupid question. It has to do with how caching systems work. Every url (for every potential size) is cached independently. In general, you can also only purge the URLs you know to exist. There is no 'wildcard' system in caching systems (it's not a database, it's a 'flat' lookup table), because that is terribly inefficient in such systems (it's also what makes them fast for caching).

There is an idea now to make all urls of one image use the same 'identifier' and then purge that identifier instead of the 'url'. How feasible this is is not yet clear, there might be performance implications.

bzimport added a comment.Via ConduitOct 25 2012, 11:43 AM

mr.heat wrote:

If I look at the file names there is kind of a "wildcard" system:

http://upload.wikimedia.org/wikipedia/commons/thumb/x/xx/xxx.png/1px-xxx.png
http://upload.wikimedia.org/wikipedia/commons/thumb/x/xx/xxx.png/2px-xxx.png
http://upload.wikimedia.org/wikipedia/commons/thumb/x/xx/xxx.png/3px-xxx.png

... and so on up to the maximum image width.

Why not simply purge all possible thumbnail sizes? This should be a no-op for non-existing thumbnail sizes and therefor not affect the performance.

Why is it not possible to get a list of all non-purged thumbnail sizes? According to the explanation above you are asking "swift" for the list of thumbnails to purge the "squids". Why not ask the squids? Why don't you purge the squids first and skip purging the swift if something went wrong?

I know all this is just working around the problem. The main question is: Why does purging the squids fail sometimes?

ArielGlenn added a comment.Via ConduitOct 25 2012, 12:52 PM

We can't actually ask the squids for a wildcard list, nor purge by providing one, because items aren't referenced by their urls but only by hashes of those. See

http://wiki.squid-cache.org/SquidFaq/OperatingSquid#How_can_I_purge_multiple_objects_from_my_cache.3F

for more about this.

TheDJ added a comment.Via ConduitOct 25 2012, 1:34 PM

@TMg purging sometimes fails, because cache purging is a fire and forget system. It is too resource intensive to request confirmation for every single 'purge' request sent to every single caching server. If the squid/varnish that has been sent the purge request, never receives the request (lost UDP packet, dropped connection between america and europe, or stray electron), then it will never know it was supposed to purge it.

Now the problem here is that when this has happened, we no longer have an accurate list anymore ( the thumbnails have disappeared from the filesystem) of what we NEED to purge. This combined with the problem that you cannot create 'wildcard' requests on squid (or varnish) because they don't store the original urls, just unique hashes of those urls, causes the problems that we see. There is no way left to get rid of the cached copy, until the cached copy expires automatically (30 days ?).

For wildcard, you propose sending a purge request for every possible image size. This would be highly inefficient however. You would replace one bounded set of requests (often no more than 10 or so) with 10000 of requests (for each possible size) and that for thousands of images that are purged every hour.

Caching really is a rather special problem that breaks the rules of many of the expectations that people tend to have.

bzimport added a comment.Via ConduitOct 25 2012, 3:43 PM

mr.heat wrote:

An user told me some thumbnails at Commons are wrong for 7 months now (sorry, I don't know the filename).

Thank you very much for the explanation. I think we agree this is kind of "broken by design". Currently we are stuck in the following situation:

  1. The user can see the error but he can not fix it.
  2. The system does not know there is an error.

This is kind of a deadlock situation. There is data loss (one subsystem thinks the thumbnail is gone but it is not) and nobody can fix this. Basically the solution is simple: We need to make the system aware of the error. Either automatically or by user request.

An idea (already suggested above) is to always send purge requests for the most common thumbnail sizes. Currently we can choose the following thumbnail and image sizes in the preferences: 120, 150, 180, 200, 220, 250, 300, 320, 640, 800 (default), 1024, 1280. That's about 10 requests (I would stop at the default) instead of 10000. I don't like this idea because it's a shotgun: You will hit something but also miss a lot.

An other idea is to allow ?action=purge on the thumbnail URL. I tried this once, e.g.

http://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Mediawiki-versionsvergleich.png/640px-Mediawiki-versionsvergleich.png

becomes

http://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Mediawiki-versionsvergleich.png/640px-Mediawiki-versionsvergleich.png?action=purge

This returned the correct image. But the first URL continued to show the wrong image. From what you describe I assume the first URL is found in the squid cache. The squid responds with the (wrong) thumbnail and that's it.

When I add something to the URL the squid cache is bypassed and a new thumbnail is created and cached in the squid with that different URL.

I'm not sure which subsystems are involved in this example but the idea is simple: Read the ?action=purge parameter. Purge that specific thumbnail size. Don't create a new thumbnail. Instead respond with a 301 redirect to the URL without the ?action=purge parameter.

Or do what Mark wrote in the mailing list. Sounds very good to me.

http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/064011.html

RobLa-WMF added a comment.Via ConduitOct 26 2012, 12:04 AM

Bumping down to high priority now. We want to fix this, but we need to fix this right, and that probably means getting a few other quicker issues fixed rather than blocking them on this.

TheDJ added a comment.Via ConduitNov 8 2012, 11:21 AM

I found a workaround to fix thumbnails affected by this I think. (though we should fix the root cause of course).

If I run /w/thumb.php?f=name.jpg&width= for the size in question, then I think this will force generation of a thumb (bypassing the cached copy). This causes a thumb file to be in the filesystem again, which makes the cached copy of this image purgeable..

So thumb.php -> purge page -> purge browser cache

aaron added a comment.Via ConduitNov 8 2012, 3:53 PM

(In reply to comment #18)

I found a workaround to fix thumbnails affected by this I think. (though we
should fix the root cause of course).

If I run /w/thumb.php?f=name.jpg&width= for the size in question, then I think
this will force generation of a thumb (bypassing the cached copy). This causes
a thumb file to be in the filesystem again, which makes the cached copy of this
image purgeable..

So thumb.php -> purge page -> purge browser cache

Yes, this like what happened in comment 8 (triggering generation of the thumbnail).

aaron added a comment.Via ConduitNov 16 2012, 8:29 PM

I think the solution to look into is what mark proposed on the mailing list?

@mark: how much work does this sound like?

mark added a comment.Via ConduitNov 19 2012, 11:07 AM

I've got an untested prototype sitting in Gerrit for testing, but I can't really deploy Varnish during the fundraiser.

aaron added a comment.Via ConduitNov 19 2012, 10:31 PM

Some people here have reported this as seeming to happen for all of their uploads. Is it literally happening for handfuls of re-uploads in a row and for a large number of people? It would be helpful for more reports here. The bug were stale thumbnails can get stuck should still be rare, unless PURGE requests are failing all the time (sounds like an operations issue).

bzimport added a comment.Via ConduitDec 12 2012, 1:14 PM

mr.heat wrote:

As I said above I do not upload much images. But when I do I always get the same error. How is this possible?

Wild guessing: I'm using Opera. Opera uses a different cache algorithm. It basically ignores some of the HTTP headers and shows images from the browser cache for a few hours. What this means is, when I re-upload an image and the file description page is shown my browser does *not* send HTTP requests for all the images on the page since they are in my browser cache. Does this confuse the software? I have to force reloading the images either by pressing Shift+Reload or by right clicking the images and choose "Reload image". This refreshs most images but some always fail.

http://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Button_base.svg/119px-Button_base.svg.png
http://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Button_base.svg/120px-Button_base.svg.png
http://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Button_base.svg/121px-Button_base.svg.png

120px never gets purged, everything else is fine.

Bawolff added a comment.Via ConduitDec 13 2012, 7:51 PM

(In reply to comment #18)

I found a workaround to fix thumbnails affected by this I think. (though we
should fix the root cause of course).

If I run /w/thumb.php?f=name.jpg&width= for the size in question, then I
think
this will force generation of a thumb (bypassing the cached copy). This
causes
a thumb file to be in the filesystem again, which makes the cached copy of
this
image purgeable..

So thumb.php -> purge page -> purge browser cache

Wouldn't it be easier to view a url like
http://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Button_base.svg/119px-Button_base.svg.png?someStuffAtEndToForceAPurgeMiss

which should force the backend to generate the image, and then do ?action=purge on image description page.

Another case is https://upload.wikimedia.org/wikipedia/bcl/b/bc/Wiki.png (provided you're in north america). Note: that's the full-sized version of the image, so it should always be there in the backend, and ?action=purge on the image description page seems to have no effect on its cached value.

a) Change SquidPurgeClientPool to return some status info on run() that checks
if any of the sockets where marked as "down" (got a non 200/404 message). Then
change the squid purge code to be first, and to not purge anything in swift
unless the squid purge succeeded. This would still be vulnerable to race
conditions (you can purge afterwards again but have the same issue as before).

Unless I missed something, that's not an option unless we give up on the whole multicast udp htcp thing.


Possibly related - I've looked at two specific instances of images not getting purged out of squid cache. In both cases the image was purged if you were visiting from europe, but not from N. America. The headers on the request to North America seem to indicate that it was coming from varnish servers, and the europe request from squid servers. I imagine there's a pretty good chance that is coincidence, but thought I should mention that anyhow.

(In reply to comment #22)

Some people here have reported this as seeming to happen for all of their
uploads. Is it literally happening for handfuls of re-uploads in a row and
for
a large number of people? It would be helpful for more reports here. The bug
were stale thumbnails can get stuck should still be rare, unless PURGE
requests
are failing all the time (sounds like an operations issue).

I think there might be significant issues with PURGE failings. see my comment at bug 38879 comment 10

bzimport added a comment.Via ConduitDec 13 2012, 8:44 PM

mr.heat wrote:

(In reply to comment #18)

/w/thumb.php?f=name.jpg&width= [...]

I tried this for the broken Button_base.svg but it does not help.

(In reply to comment #24)

Wouldn't it be easier to view a url like [...]

Yes, I already suggested this. I would like to add ?action=purge to the thumb URL. Currently this creates a second thumb with the same size but a different URL.

the image was purged if you were visiting from europe, but not from
N. America.

Maybe it's the other way around in my case? I'm sitting in Germany.

By the way, I donated the 100 Euro I promised. I really did. Even if the bug is not fixed.

Bawolff added a comment.Via ConduitDec 13 2012, 9:29 PM

(In reply to comment #25)

(In reply to comment #18)
> /w/thumb.php?f=name.jpg&width= [...]

I tried this for the broken Button_base.svg but it does not help.

(In reply to comment #24)
> Wouldn't it be easier to view a url like [...]

Yes, I already suggested this. I would like to add ?action=purge to the thumb
URL. Currently this creates a second thumb with the same size but a different
URL.

> the image was purged if you were visiting from europe, but not from
> N. America.

Maybe it's the other way around in my case? I'm sitting in Germany.

In my test, Europe has the correct thumb, and N.America has the old thumb for the 120px size of [[file:Button_base.svg]]. (I'm not all that familar with wmf server infrastructure, so I don't know if what I'm calling the europe servers actually serve all of europe). [which makes three images where everything is ok in europe but not north america]

I'm figuring out which is the "right" thumb by comparing with the 119px size. Both images look very similar, with a slight difference in the angle of the cut-off of the bottom border

By the way, I donated the 100 Euro I promised. I really did. Even if the bug
is
not fixed.

Technical problems with the sites - driving donations since roughly 2003 ;)

Bawolff added a comment.Via ConduitDec 14 2012, 7:01 PM
  • Bug 43117 has been marked as a duplicate of this bug. ***
Bawolff added a comment.Via ConduitDec 14 2012, 7:03 PM

renaming bug to better reflect what is currently being reported. bug 43117 is another 2 reports of this, both instances only occurring for the north america caches.

bzimport added a comment.Via ConduitDec 14 2012, 7:57 PM

mr.heat wrote:

(In reply to comment #26)

Technical problems with the sites - driving donations since roughly 2003

Adding bugs to the system to make the people donate? ;-)

(In reply to comment #28)

renaming bug to better reflect what is currently being reported. bug 43117
is another 2 reports of this, both instances only occurring for the north
america caches.

I'm confused. As said I'm sitting in Germany. I always work with an user account (from what I know this affects some of the caching stuff going on). I'm always logged in. How is it possible that some caches in North America affect my image uploads I do from Germany?

For the sake of completeness I attached the last half of my traceroute for the broken thumb. I think the trace ends in the Netherlands. Not in America. By the way, the thumb is fixed now.

8 54 ms 53 ms 56 ms ge0-1-0-cr0.ixf.de.as6908.net [80.81.192.244]
9 59 ms 59 ms 59 ms te3-4-3502-cr0.nik.nl.as6908.net [78.41.154.17]
10 57 ms 58 ms 57 ms ge-2-2.br1-knams.wikimedia.org [78.41.155.38]
11 58 ms 58 ms 59 ms ve7.te-8-1.csw1-esams.wikimedia.org [91.198.174.250]
12 60 ms 58 ms 59 ms upload-lb.esams.wikimedia.org [91.198.174.234]

http://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Button_base.svg/120px-Button_base.svg.png

Bawolff added a comment.Via ConduitDec 14 2012, 8:27 PM

I always work with an user
account (from what I know this affects some of the caching stuff going on)

That's true for normal page views, but images it doesn't really matter if you're logged in.

For the sake of completeness I attached the last half of my traceroute for the
broken thumb. I think the trace ends in the Netherlands. Not in America. By the
way, the thumb is fixed now.

Thanks. You're right you're accessing through the european caching servers. In my tests (still to this moment) the 120px-Button_base.svg.png is the new (correct) version if accessing through upload-lb.esams.wikimedia.org (Netherlands) but are still incorrect if accessing via upload-lb.eqiad.wikimedia.org (N. America). There are several other examples that show this trend. Perhaps there was two separate issues here that got conflated (with something else affecting you), perhaps for some reason you used to be connecting to the north america one and suddenly you started accessing via the europe one so now the issue disappears for you. (There's also always the possibility that I misunderstand something, as some of this is getting much farther into wmf server architecture than I am familiar with)

Bawolff added a comment.Via ConduitDec 18 2012, 11:28 PM
  • Bug 42963 has been marked as a duplicate of this bug. ***
Bawolff added a comment.Via ConduitDec 18 2012, 11:28 PM

(In reply to comment #31)

Other issues potentially related to this:
https://bugzilla.wikimedia.org/show_bug.cgi?id=42963

Marked as dupe.

https://bugzilla.wikimedia.org/show_bug.cgi?id=42653

I think that's an unrelated issue

RobLa-WMF added a comment.Via ConduitDec 21 2012, 8:14 PM

The plan is to move fully to Varnish for images (sometime after the datacenter migration), after which point we can address this problem in earnest. Until that time, this isn't Aaron's highest priority, so I'm moving this to high priority. It's something we're aware of, but the Squid-based solutions are unreliable and expensive enough that it doesn't make sense to invest in that as a solution.

The workarounds described above (in particular, comment #18) will hopefully work for now.

Bawolff added a comment.Via ConduitDec 21 2012, 9:35 PM

(In reply to comment #34)

The plan is to move fully to Varnish for images (sometime after the
datacenter
migration), after which point we can address this problem in earnest. Until
that time, this isn't Aaron's highest priority, so I'm moving this to high
priority. It's something we're aware of, but the Squid-based solutions are
unreliable and expensive enough that it doesn't make sense to invest in that
as
a solution.

Seems like before we move to a new system, we should fix the issues with the new system. Serving outdated version of images causes users a lot of frustration.

The workarounds described above (in particular, comment #18) will hopefully
work for now.

This bug might have multiple issues conflated, but the work around in comment 18 doesn't work.

Bawolff added a comment.Via ConduitDec 24 2012, 9:59 PM
  • Bug 43364 has been marked as a duplicate of this bug. ***
Betacommand added a comment.Via ConduitDec 26 2012, 3:23 AM

attempted to follow instructions in comment 18 and it did not work. see http://en.wikipedia.org/wiki/File:Compact_Cassette_Logo.svg

RobLa-WMF added a comment.Via ConduitDec 26 2012, 7:49 PM

(In reply to comment #37)

attempted to follow instructions in comment 18 and it did not work. see
http://en.wikipedia.org/wiki/File:Compact_Cassette_Logo.svg

Betacommand, I'm not seeing an obvious problem there. What is it that I should be looking for?

bzimport added a comment.Via ConduitDec 26 2012, 8:05 PM

lexein-w wrote:

I just looked at https://en.wikipedia.org/wiki/File:Compact_Cassette_Logo.svg and there's no main SVG image displayed at top of page. I see this text:
"File:Compact Cassette Logo.svg"
(which is superimposed over a transparent background, visible if moused over)
"Compact_Cassette_Logo.svg ‎(SVG file, nominally 170 × 40 pixels, file size: 12 KB)"
"This image rendered as PNG in other sizes: 200px, 500px, 1000px, 2000px."
The PNG version links are corret.
The thumbnail (page bottom) is displayed.
I'm on the U.S. West Coast.

RobLa-WMF added a comment.Via ConduitDec 26 2012, 9:21 PM

Ok, sorry about that, I see the problem now. I had forgotten to access this anonymously. We don't have many people here who can debug this right now (due to the holidays), but I'll see what I can do.

tstarling added a comment.Via ConduitDec 26 2012, 11:15 PM

(In reply to comment #9)

This isn't really easy to prevent. One can do a few things:
a) Change SquidPurgeClientPool [...]

We don't use SquidPurgeClientPool, we use HTCP.

The bug summary is "North america upload caches aren't responding to squid purges". This is confusing since:

  • Upload uses Varnish, not Squid.
  • The Varnish caches in question are in fact receiving and responding to HTCP CLR messages.
  • There is no reason to think the problem is limited to North America.

I confirmed that HTCP delivery is working using tcpdump on all Varnish servers in the cp1022-1036 range, and I checked one of them with purgeList.php to ensure that the HTCP CLR message was correctly acted on.

Apparently the problem is not that purging doesn't work, but that on specific occasions at some time in the past, it didn't work. Debugging that is a very different kind of problem to debugging a complete and current failure.

Bawolff added a comment.Via ConduitDec 26 2012, 11:55 PM

(In reply to comment #42)

The bug summary is "North america upload caches aren't responding to squid
purges". This is confusing since:

You're right, that was a poor choice for bug title. Sorry about that.

Apparently the problem is not that purging doesn't work, but that on specific
occasions at some time in the past, it didn't work. Debugging that is a very
different kind of problem to debugging a complete and current failure.

I can confirm that previous cases I tested doing ?action=purge on image description page which had no effect (if accessing upload.wikimedia.org via 208.80.154.235) now work correctly.


I had forgotten to access this anonymously.

It shouldn't matter if you're anonymous for images. (And furthermore login cookies aren't sent to upload.wikimedia.org so the servers wouldn't even know if you're an anon)

RobLa-WMF added a comment.Via ConduitDec 27 2012, 12:06 AM

Asher restarted varnishhtcpd on all of Eqiad (North American caching center) machines about 30 minutes ago (simultaneously with Tim investigating this problem). Thus, probably the most urgent of the problems associated with this issue has been solved.

Asher also spotted and solved another problem, where 404'd images would get cached for 30 days. He fixed that problem as well with Gerrit #40762.

There's another problem that he spotted that isn't fixed yet, which I've filed as bug 43448. Until that bug is fixed, we may have to occasionally kick varnishhtcpd.

It seems in retrospect like bug 42963 should not have been closed as a duplicate of this one, since the time that bug 42963 was filed could have been when varnishhtcpd stopped responding to HTCP requests. The problem that we're tracking with this bug is any problem related to the fact that we currently don't have a mechanism for purging *all* thumbnails associated with a given original. We may need to close this issue and open a new one to avoid having the same problem again.

bzimport added a comment.Via ConduitDec 31 2012, 11:34 AM

mr.heat wrote:

(In reply to comment #42)

There is no reason to think the problem is limited to North America.

It is not.

Apparently the problem is not that purging doesn't work, but that on
specific occasions at some time in the past, it didn't work. Debugging
that is a very different kind of problem to debugging a complete and
current failure.

I know that and I'm sorry. I found comment #15 very helpful.

(In reply to comment #44)

we currently don't have a mechanism for purging *all* thumbnails associated
with a given original.

As suggested above several times (e.g. in comment #16) I would like to add ?action=purge to a specific thumbnail URL. In my opinion this would be a very helpful workaround to purge such broken thumbnails. From what I understand this is easy to implement. Possibly two lines of code in thumb.php (one being a redirect) and maybe a [qsappend] in an Apache configuration. Could you please take a look and tell us if it's possible to add this feature? Thank you.

Bawolff added a comment.Via ConduitJan 6 2013, 8:51 PM

(In reply to comment #44)

It seems in retrospect like bug 42963 should not have been closed as a
duplicate of this one, since the time that bug 42963 was filed could have
been
when varnishhtcpd stopped responding to HTCP requests. The problem that
we're
tracking with this bug is any problem related to the fact that we currently
don't have a mechanism for purging *all* thumbnails associated with a given
original. We may need to close this issue and open a new one to avoid having
the same problem again.

Reading over this bug - comment 2 suggests this was always the varnishhtcpd issue imho. While the cant purge all thumbs is a potential issue, without the other issue I imagine it is rather rare and even rarer that it would get noticed, although I have no evidence to back that assertion up. Furthermore appending action=purge to thumb url and then to image desc page would probably fix it, which users may try on their own. Obviously it should be fixed if possible but I don't think it is as bad as it has been made out to be

Bawolff added a comment.Via ConduitJan 14 2013, 4:32 PM

reports at commons of http://upload.wikimedia.org/wikipedia/commons/8/85/Complexe_sonore.png (as well as its thumbs) not purging. Notes should be a darker green colour ( compare with http://upload.wikimedia.org/wikipedia/commons/8/85/Complexe_sonore.png?nocache ). Tested from North America.

McZusatz added a comment.Via ConduitJan 14 2013, 7:52 PM

(In reply to comment #47)

Tested from North America.

works fine in Germany

bzimport added a comment.Via ConduitJan 15 2013, 5:51 PM

johnson487682 wrote:

This seems to be another instance of this problem:
http://en.wikipedia.org/wiki/Wikipedia:Graphics_Lab/Map_workshop#Map_thumbnail_problem
(or maybe bug 31680?)

bzimport added a comment.Via ConduitJan 19 2013, 1:55 PM

mailmilan wrote:

I'm not sure is this a right place. We have a problem on sr.wp to restore our official logo after holidays ( https://sr.wikipedia.org/wiki/File:Wiki.png ). I tried everything, but it didn't work. Any ideas?

Bawolff added a comment.Via ConduitJan 19 2013, 2:58 PM

(In reply to comment #50)

I'm not sure is this a right place. We have a problem on sr.wp to restore our
official logo after holidays ( https://sr.wikipedia.org/wiki/File:Wiki.png
). I
tried everything, but it didn't work. Any ideas?

Sounds like this bug. If its the same cause as last time somebody with access to the servers just has to restart the program (varnishhtcpd) that makes sure old versions of images go away.

If its critical for your wiki to make the logo go back to normal now, you could put some css in mediawiki:common.css (important remember to remove after this bug is fixed as it could cause problems with cache clearing in the future. Note havent tested this but it should work)

#p-logo a {background-image: url(//upload.wikimedia.org/wikipedia/sr/b/bc/Wiki.png?1); }

bzimport added a comment.Via ConduitJan 19 2013, 3:46 PM

mailmilan wrote:

Thanks Brian. It's not so critical. We can wait a few days. But I'll try that css customization if the problem remains.

Nemo_bis added a comment.Via ConduitJan 22 2013, 7:28 AM

(In reply to comment #51)

(In reply to comment #50)
> I'm not sure is this a right place. We have a problem on sr.wp to restore our
> official logo after holidays ( https://sr.wikipedia.org/wiki/File:Wiki.png
> ). I
> tried everything, but it didn't work. Any ideas?

Sounds like this bug. If its the same cause as last time somebody with access
to the servers just has to restart the program (varnishhtcpd) that makes sure
old versions of images go away.

I don't know about restarting, but the problem was split to bug 43448.

Bawolff added a comment.Via ConduitJan 23 2013, 4:49 AM
  • Bug 44269 has been marked as a duplicate of this bug. ***
tstarling added a comment.Via ConduitJan 23 2013, 11:32 AM

I wrote:

(In reply to comment #42)

I confirmed that HTCP delivery is working using tcpdump on all Varnish
servers
in the cp1022-1036 range, and I checked one of them with purgeList.php to
ensure that the HTCP CLR message was correctly acted on.

I'm not sure how this is possible, since from my analysis and testing of varnishhtcpd today, it appears to have been completely broken since October 25.

It should work better now, although there may be some packet loss. I changed varnishhtcpd to be single-threaded, but it's threatening to exhaust a single CPU.

bzimport added a comment.Via ConduitJan 23 2013, 11:35 AM

mailmilan wrote:

Problem on sr.wp is now fixed.

bzimport added a comment.Via ConduitJan 23 2013, 2:27 PM

johnson487682 wrote:

The problem with http://en.wikipedia.org/wiki/Wikipedia:Graphics_Lab/Map_workshop#Map_thumbnail_problem still remains, but it's been slowly correcting itself. When I view the thumbnails on the page http://en.wikipedia.org/wiki/File:USA_Wisconsin_GSUSA_council_boundaries.png, they all render correctly now except the 585px thumbnail.

Bawolff added a comment.Via ConduitJan 23 2013, 3:48 PM

(In reply to comment #57)

The problem with
http://en.wikipedia.org/wiki/Wikipedia:Graphics_Lab/
Map_workshop#Map_thumbnail_problem
still remains, but it's been slowly correcting itself. When I view the
thumbnails on the page
http://en.wikipedia.org/wiki/File:USA_Wisconsin_GSUSA_council_boundaries.png,
they all render correctly now except the 585px thumbnail.

The 585px thumb should be fixed now (I used the workaround of going to the thumb's url, adding some junk to the end of the url to bypass varnish (ex http://upload.wikimedia.org/wikipedia/en/thumb/5/53/USA_Wisconsin_GSUSA_council_boundaries.png/585px-USA_Wisconsin_GSUSA_council_boundaries.png?bypasscache )[in order to ensure there's a copy on the server so that mediawiki knows to send purge for that size] and then did ?action=purge on http://en.wikipedia.org/wiki/File:USA_Wisconsin_GSUSA_council_boundaries.png?action=purge .

bzimport added a comment.Via ConduitJan 23 2013, 9:17 PM

rriuslaw wrote:

Bawolff marked 44269 as a duplicate of this one, but I don't see why. This bug is about thumbs not displaying properly. Bug 44269 is about the main image not updating. Thumbs for the file that prompted 44269 actually work fine, both on the file version list and when the image is used with the thumb parameter on the English Wikipedia. Unfortunately, the file is mostly used in infoboxes, so the fact thumbs work properly doesn't help.

bzimport added a comment.Via ConduitJan 24 2013, 7:56 PM

johnson487682 wrote:

Thanks Brian--the workaround works great!

bzimport added a comment.Via ConduitJan 25 2013, 8:55 PM

mr.heat wrote:

This may be a cross post (see bug 31680) but to be honest I have no idea why this was split into multiple bugs. It's the same problem over and over again. Here is a fresh example from the German Wikipedia:

File description page:
http://de.wikipedia.org/wiki/Datei:Radiobuttons.gif

Broken garbage:
http://upload.wikimedia.org/wikipedia/de/d/dc/Radiobuttons.gif
http://upload.wikimedia.org/wikipedia/de/thumb/d/dc/Radiobuttons.gif/120px-Radiobuttons.gif

How it should look:
http://upload.wikimedia.org/wikipedia/de/d/dc/Radiobuttons.gif?dummy

Tracert:
Routenverfolgung zu upload-lb.esams.wikimedia.org [91.198.174.234] über maximal 30 Abschnitte:
[I removed the first few steps]

8    53 ms    52 ms    51 ms  ge0-1-0-cr0.ixf.de.as6908.net [80.81.192.244]
9    58 ms    57 ms    56 ms  te3-4-3502-cr0.nik.nl.as6908.net [78.41.154.17]

10 56 ms 54 ms 56 ms ge-2-2.br1-knams.wikimedia.org [78.41.155.38]
11 56 ms 56 ms 56 ms ve7.te-8-1.csw1-esams.wikimedia.org [91.198.174.250]
12 56 ms 58 ms 64 ms upload-lb.esams.wikimedia.org [91.198.174.234]

Bawolff added a comment.Via ConduitJan 25 2013, 9:08 PM

(In reply to comment #61)

This may be a cross post (see bug 31680) but to be honest I have no idea why
this was split into multiple bugs. It's the same problem over and over again.
Here is a fresh example from the German Wikipedia:

File description page:
http://de.wikipedia.org/wiki/Datei:Radiobuttons.gif

Broken garbage:
http://upload.wikimedia.org/wikipedia/de/d/dc/Radiobuttons.gif
http://upload.wikimedia.org/wikipedia/de/thumb/d/dc/Radiobuttons.gif/120px-
Radiobuttons.gif

How it should look:
http://upload.wikimedia.org/wikipedia/de/d/dc/Radiobuttons.gif?dummy

Tracert:
Routenverfolgung zu upload-lb.esams.wikimedia.org [91.198.174.234] über
maximal
30 Abschnitte:
[I removed the first few steps]

8    53 ms    52 ms    51 ms  ge0-1-0-cr0.ixf.de.as6908.net [80.81.192.244]
9    58 ms    57 ms    56 ms  te3-4-3502-cr0.nik.nl.as6908.net

[78.41.154.17]

10    56 ms    54 ms    56 ms  ge-2-2.br1-knams.wikimedia.org [78.41.155.38]
11    56 ms    56 ms    56 ms  ve7.te-8-1.csw1-esams.wikimedia.org

[91.198.174.250]

12    56 ms    58 ms    64 ms  upload-lb.esams.wikimedia.org

[91.198.174.234]

Sounds like different root cause (since esams still uses squid instead of varnish afaik).

Bawolff added a comment.Via ConduitJan 25 2013, 9:32 PM

If all the new reuploads (upload of a new version) are failing to be visible when accessing in europe (via esams) my initial guess would be that the udp multicast tunnel was forgotten about when moving data centers ( that's pure speculation though. Ive done no investigation and didnt even look at the docs to see if htcp multicast at wmf works like I think it does).

Nemo_bis added a comment.Via ConduitJan 26 2013, 10:16 AM

(In reply to comment #63)

If all the new reuploads (upload of a new version) are failing to be visible
when accessing in europe (via esams) my initial guess would be that the udp
multicast tunnel was forgotten about when moving data centers ( that's pure
speculation though. Ive done no investigation and didnt even look at the docs
to see if htcp multicast at wmf works like I think it does).

LeslieCarr was investigating on it when you wrote this.

  • 01:21 LeslieCarr: htcp purging across datacenters now "works". dobson is now receiving purge requests on multicast group 239.128.0.112 port 4827 and transmitting them via udpmcast.py (started by rc.local) to hooft in esams
  • 01:14 LeslieCarr: deactivating multicast for 1 minute in order to try and flush the multicast forwarding table

https://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=55822&oldid=55819&diffonly=yes

The most recent reported reuploads seem to work now?
https://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=88876606#Trouble_with_uploading_new_versions

bzimport added a comment.Via ConduitJan 26 2013, 12:38 PM

mr.heat wrote:

Reuploads from today, January 26, 12:00 o'clock still failing.

http://commons.wikimedia.org/wiki/File:Wappen_Landkreis_Aurich.svg

bzimport added a comment.Via ConduitJan 27 2013, 6:55 PM

mr.heat wrote:

(In reply to comment #62)

Sounds like different root cause (since esams still uses squid instead of
varnish afaik).

I don't know why the title of the bug was changed to something with "Varnish". Originally the bug was about invalidation not working when re-uploading files from *Germany*. If there are multiple problems (which is obviously the case) you should split this into multiple bugs. Here is an other example from today:

File description page:
http://commons.wikimedia.org/wiki/File:Laser_mirror_reflection.jpg

Broken garbage including the original size (!):
http://upload.wikimedia.org/wikipedia/commons/8/82/Laser_mirror_reflection.jpg
http://upload.wikimedia.org/wikipedia/commons/thumb/8/82/Laser_mirror_reflection.jpg/170px-Laser_mirror_reflection.jpg

How it should look:
http://upload.wikimedia.org/wikipedia/commons/thumb/8/82/Laser_mirror_reflection.jpg/120px-Laser_mirror_reflection.jpg

Tracert is the same as above. This is so damn annoying. What's the problem with fixing this bug? There must be *thousands* and *thousands* of examples. It's reported by dozens of users in almost every discussion page in every project:

http://commons.wikimedia.org/wiki/Commons:Forum#Falsche_Bildversion
http://de.wikipedia.org/wiki/Wikipedia_Diskussion:Fotowerkstatt
http://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Bildaktualisierung_schl.C3.A4gt_fehl_trotz_.E2.80.9E.3Faction.3Dpurge.E2.80.9C

McZusatz added a comment.Via ConduitJan 27 2013, 7:02 PM

All(!) reuploads failing. Please fix this because it is impossible to update the latest version of any file.

Aklapper added a comment.Via ConduitJan 27 2013, 7:25 PM

TMg: This is neither immediate nor a blocker as it does not block reading articles (or such). See http://www.mediawiki.org/wiki/Bugzilla/Fields#Priority for some information. Reverting.

(In reply to comment #66)

I don't know why the title of the bug was changed to something with
"Varnish".

Likely because it describes the bug better. Varnish is an HTTP accelerator which routes web site requests to appropriate clusters (load balancing). It is used by Wikimedia. See http://en.wikipedia.org/wiki/Varnish_%28software%29 for more information.

What's the problem with fixing this bug?

See comment 64. It was explained there.

There must be *thousands* and *thousands* of examples.

The problem here is not missing examples, the problem is finding reasons.

I am going to quote some more info from Leslie here in case that comment 64 wasn't sufficient. I've added a few explanatory words for some terms to hopefully make this more understandable:

The problem is about multicast requests to send HTCP purge requests to group 239.128.0.112 port 4827.
The requests arrive at a machine in Tampa ("dobson") where they are put through a multicast to unicast relay to send them to our European AS for cache purges. Until earlier this week, these notices were sourced from the old Tampa datacenter as well, so if this problem existed, it was masked. This week, we switched them to be sourced from the new eqiad datacenter and then discovered that the traffic for group 239.128.0.112 (and ONLY group 239.128.0.112) was not being delivered.
[...]
In addition, if we look at the multicast routing on cr1-sdtpa.wikimedia.org and cr2-pmtpa.wikimedia.org, it does not seem to be seeing the sources of traffic for 239.128.0.112, even though it is fine for all of the other groups.

Bawolff added a comment.Via ConduitJan 27 2013, 7:47 PM

The server admin log says multicast issues fixed (yesterday). So are they fixed or not? Your quote from Leslie is unclear in that regard ( furthermore, out of curiosity where is that quote from? Somewhere public or secret ops mailing list?) In any case the continued reports suggest there are still outstanding issues.

In regards to priority - while it doesnt prevent reading I would point out that wiki is hawiain for quick, not for "thirty days or so until entries fall out of cache". The frustration building in regards to this bug is understandable.

bzimport added a comment.Via ConduitJan 27 2013, 11:07 PM

jonathandodd wrote:

I'm afraid I can't offer any insight into solutions, but having run into this today it was requested on IRC that I post this example, with screenshots (to make it real for y'all? :)

http:// i48.tinypic.com/2rnkh89.jpg
http://pruebita.com/commons-thumbnail-looks-fine-for-me-2013-01-27.png

bzimport added a comment.Via ConduitJan 27 2013, 11:10 PM

jonathandodd wrote:

(In reply to comment #70) Sorry, in case it isn't clear: the screenshots show the file simultaneously not working for me yet working for somebody else.

Nemo_bis added a comment.Via ConduitJan 28 2013, 7:51 AM

The current summary mentions only thumbnails and is incorrect: for reuploads, everything is currently broken, originals as well.
I've added the current situation to the summary, leaving also the previous summary as possible explanation.

Nikerabbit added a comment.Via ConduitJan 28 2013, 7:59 AM

I've wasted considerable amount of time by reading an old version of design document for many days.

Nemo_bis added a comment.Via ConduitJan 28 2013, 9:15 AM

This bug no longer has an assignee.
It matches the definition of "immediate" priority; the guideline is to have an assignee, setting Robla until he finds one...

RobLa-WMF added a comment.Via ConduitJan 28 2013, 4:45 PM

Hi everyone. Our apologies for the problems with caching. We're monitoring the situation, and will continue to monitor it. As of right now, things *should* be better, but there's still work that Ops needs to do to know that for sure.

This bug, unfortunately, isn't a useful description of anything anymore, and I can't in good conscience assign this to a developer or an ops engineer. It has become a place that people to vent about caching issues generally, and is more heat than light.

There are at least three different issues that I can see:

  1. Squid/varnish cache contains images that aren't in Swift, and don't purge because only images in Swift get purged. This is the problem that Aaron described in comments 8 and comment 9 of this bug. We plan to fix this problem after we fully migrate to Varnish, but it's a complicated fix, and won't happen right away. The workaround for this problem is in comment 18. My response is unchanged from comment 34.
  2. Multicast purge requests don't make it to all of the Varnish & Squid caches. This is most likely the problem that people are actually complaining about the most in the past week. This is generally an Ops issue, and in fact, Leslie Carr is working on the problem, with help from others. The issue is that some of our routing/switch equipment doesn't reliably route the purge requests to all of the places they need to be. They have periodic fixes for the problem, but we don't know yet if they have things configured such that the problem will remain fixed.
  3. Cache purging issues are not monitored sufficiently. This is bug 43449, and is another issue that needs to be addressed by the Ops team.

I'm going to leave this issue open for now, but at this point, but once the other issues get filed, I plan to close this issue and use the other bug reports to track this.

Aklapper added a comment.Via ConduitJan 28 2013, 5:09 PM

Mid-air collision with RobLa, anyway:

(In reply to comment #69)

The server admin log says multicast issues fixed (yesterday). So are they
fixed or not?

See https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=535338671#Latest_update_from_Operations

(In reply to comment #61)

This may be a cross post (see bug 31680) but to be honest I have no idea why
this was split into multiple bugs.

This was already explained in comment 2: The original posting (comment 0) covered three different issues. In general, bug reports should only be about ONE issue, otherwise they become messy and unfixable. Like this report now.

The numbers of the already existing bug reports for the other two issues that you face(d) were provided in comment 2 and 3.

The remaining issue (1) that this report was left to be about became
"?action=purge on the file description page no longer regenerates thumbnails"
Comment 18 mentions a workaround for this specific problem.

Issue (2) about "reuploading isn't properly purging thumbnails" is supposed to be bug 31680, but recent postings mixed this all together in this report again. Please use bug 31680.

Plus comments 67 and 71 then added "all reuploads failing, not only thumbnails". Different issue, unrelated to thumbnails.
This needs a separate report.

Reverting today's summary change which broadened the topic.

Side note speculation: When the scope of a report becomes very blurry I can understand if developers unassign themselves.

Please try to file separate bug reports for problems differing from the issue that a bug report is about, otherwise reports like this one become impossible fo fix.

bzimport added a comment.Via ConduitJan 28 2013, 8:37 PM

mr.heat wrote:

(In reply to comment #76)

The original posting (comment 0) covered three different issues.

Two. The 404 issue was split into a separate bug and fixed. Everything else is the same issue. A purge is a purge no matter if it is triggered by an ?action=purge or by reuploading a file. From what I know both do the same. Both fail randomly for over a year now.

Comment 75 explains this very well (thank you very much). His point 1 ("Squid/varnish cache contains images that aren't in Swift") is not a separate issue. It is an effect of point 2 ("Multicast purge requests don't make it to all of the Varnish & Squid caches").

Bug 41130 was created because bug 31680 was assigned to the wrong developers. It became clear that this is not a MediaWiki bug. However bug 31680 was not closed or reassigned because there *is* something the MediaWiki developers should do. They should implement a workaround to help us break out of that deadlock situation. See comment 16 and several others. No one ever replied to that (maybe because it should be moved to bug 31680).

Five days ago bug 31680 was reassigned. Now it is a duplicate.

needs a separate report.

I will not create more reports with the same issue. In the past they were all closed as duplicates.

Nemo_bis added a comment.Via ConduitJan 28 2013, 8:56 PM

(In reply to comment #77)

> needs a separate report.

I will not create more reports with the same issue. In the past they were all
closed as duplicates.

I think this:

(In comment #75)

I'm going to leave this issue open for now, but at this point, but once the
other issues get filed, I plan to close this issue and use the other bug
reports to track this.

means Rob or others at WMF will split this bug in more bugs as they feel needed/best to track and address the specific issues. Probably, given comment 75, it's something like:
A) a bug for comment 34 (migration to varnish) if one doesn't exist,
B) one to ask and investigate mass workarounds/fixes as in comment 18,
C) for multicast purge requests failures etc., one or more bugs for what's deemed viable in a-d options comment 9 + the last two paragraphs of it (Leslie is working on the last one, if I understand correctly).

Indeed this is a bit too confused for random users like us to do. :-)

Bawolff added a comment.Via ConduitJan 28 2013, 9:04 PM

(In reply to comment #77)

(In reply to comment #76)
> The original posting (comment 0) covered three different issues.

Two. The 404 issue was split into a separate bug and fixed. Everything else
is
the same issue. A purge is a purge no matter if it is triggered by an
?action=purge or by reuploading a file. From what I know both do the same.
Both
fail randomly for over a year now.

Comment 75 explains this very well (thank you very much). His point 1
("Squid/varnish cache contains images that aren't in Swift") is not a
separate
issue. It is an effect of point 2 ("Multicast purge requests don't make it to
all of the Varnish & Squid caches").

What? No. Not sending a purge at all is completely different from the purge being lost along the way. Different causes and different symptoms. Of course without the multicast issue the local file cache would not be so out of sync and as a result the squid/varnish cache has things not in swift bug would appear very rarely to the point where it probably wouldn't be noticed.

Bug 41130 was created because bug 31680 was assigned to the wrong developers.
It became clear that this is not a MediaWiki bug. However bug 31680 was not
closed or reassigned because there *is* something the MediaWiki developers
should do. They should implement a workaround to help us break out of that
deadlock situation. See comment 16 and several others. No one ever replied to
that (maybe because it should be moved to bug 31680).

There is a work around (albeit an unintuitive one). The reason it didnt work (at the time comment 16 was made) was because (in my not entirely substantiated opinion) this bug was misdiagonosed as being about files in swift being out of sync with files in varnish when actually the bug was due to varnishhtcpd being borked.

Five days ago bug 31680 was reassigned. Now it is a duplicate.

> needs a separate report.

I will not create more reports with the same issue. In the past they were all
closed as duplicates.

bzimport added a comment.Via ConduitJan 28 2013, 9:38 PM

rriuslaw wrote:

For some reason the file referred to in bug 44269 is mostly corrected, but the 2000px PNG is showing a version of the file from July 2012. Ten uploads have been made since then: three in September; one each in October, November, and December; and four in January. I'm not sure whether that fits the borked varnishhtcpd theory because I don't know what that means.

File: https://commons.wikimedia.org/wiki/File:41st_Can_Senate.svg

2000px version: http://upload.wikimedia.org/wikipedia/commons/thumb/4/41/41st_Can_Senate.svg/2000px-41st_Can_Senate.svg.png

Bawolff added a comment.Via ConduitJan 28 2013, 9:46 PM

(In reply to comment #80)

For some reason the file referred to in bug 44269 is mostly corrected, but
the
2000px PNG is showing a version of the file from July 2012. Ten uploads have
been made since then: three in September; one each in October, November, and
December; and four in January. I'm not sure whether that fits the borked
varnishhtcpd theory because I don't know what that means.

That's the old issue. The current issue is something else (something to do with udp multicast probably involving the tunnel between data centers. Its kind of unclear the precise cause afaik)

However at first glance the thumb seems consistent with neither (because adding ?Randomstring to the end of the url doesnt seem to show most recent which should happen for all presented theories) which is intetesting.

Need to test with something better than my cell phone ;)

File: https://commons.wikimedia.org/wiki/File:41st_Can_Senate.svg

2000px version:
http://upload.wikimedia.org/wikipedia/commons/thumb/4/41/41st_Can_Senate.svg/
2000px-41st_Can_Senate.svg.png

MZMcBride added a comment.Via ConduitJan 29 2013, 7:43 AM

(In reply to comment #75)

The workaround for this problem is in comment 18.

The rest of your post looked fine, but this part (referring to Derk-Jan's suggestion to use thumb.php)... God help me, are we really going to encourage this thumb.php hack? This doesn't seem remotely sane.

Nemo_bis added a comment.Via ConduitJan 29 2013, 7:51 AM

(In reply to comment #82)

(In reply to comment #75)
> The workaround for this problem is in comment 18.

The rest of your post looked fine, but this part (referring to Derk-Jan's
suggestion to use thumb.php)... God help me, are we really going to encourage
this thumb.php hack? This doesn't seem remotely sane.

At least it helps debugging: bawolff has just de-duped bug 44269, and in that case thumb.php doesn't work; it also doesn't work in some cases where there are size rounding problems, anyway (or was that fixed in MediaWiki?).

bzimport added a comment.Via ConduitJan 29 2013, 9:39 AM

mr.heat wrote:

(In reply to comment #79)

Not sending a purge at all is completely different from the purge
being lost along the way.

Doesn't reuploading a file sends a purge? I have the feeling you are talking about a different issue. Please consider creating a separate report for this.

bzimport added a comment.Via ConduitJan 29 2013, 10:08 AM

mr.heat wrote:

I'm pretty sure this won't be helpful. However the following is new to me and explains a lot (like people calling me an idiot because they see the right image but I don't). Currently when I load these thumbnails:

http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Wappen_Landkreis_Aurich.svg/140px-Wappen_Landkreis_Aurich.svg.png
http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Wappen_Landkreis_Aurich.svg/423px-Wappen_Landkreis_Aurich.svg.png

and press F5 multiple times, sometimes it shows the old version and sometimes the new. Obviously this is an effect of the load balancer. I would be happy to give you more information but tracert always outputs the same (see comment 29).

You are right, there are a lot of things to do:

a) Stop all this from happening. If there are multiple causes it's *your* job to split this into multiple reports. We (the users) can't do this. All we see is random garbage.
b) Fix the broken thumbnails.
c) Fix the broken original size images.
d) Give us one or more tools to fix this ourself. None of the ideas above works, not adding a random parameter to the thumbnail URL, not the /thumb.php?... call.

Bawolff added a comment.Via ConduitJan 29 2013, 5:29 PM

Just to clarify, Here's a summary of this bug:

There are 3 main issues mentioned in this bug, corresponding to three stages of an HTCP packet's life cycle (Generation, in transit, destination)

  1. The generation of purge packets. Sometimes MW doesn't realize that it needs to send a purge for certain thumb sizes. This is what (I believe) TMg is referring to by a "deadlock" situation in comment 9 (Not to be confused with the varnishhtcpd deadlock which is entirely separate). This is also the issue that Robla refers to as "Squid/varnish cache contains images that aren't in Swift". This is tracked by bug 44428.

The symptoms of this bug is that some (most) thumbnails are regenerated perfectly fine, however there are a few that are "stuck". If all new thumbnails are failing, then it is not this bug. action=purge 'ing the image description pages does not fix the stuck thumbnails. However the work arounds (either comment 18 's work around, or mine from comment 43. Both work the same way) will fix stuck thumbnails that are suffering from this issue. Note that these work arounds do not work if the other bugs are present.

  1. (All) the purge packets get lost in transits going to certain caches. Its not 100% clear if this is fixed or not. Something is wrong with the program (or hardware, etc) that "tunnels" purge packets generated in North America to the cache servers in Amsterdam. In affect, MediaWiki yells to the world that "Image foo has been updated", but only the North America cache servers can hear mediawiki. The symptoms of this bug are: Only people accessing via Europe see the issue. All thumbs (along with full size versions of recently updated images) are outdated for Europe people. Additionally Annonoymous users accessing via Europe will see outdated versions of pages (Since the htcp tunnel also tunnels non-image purges). bug 44391 was specificly about that.
  1. Issues at the destination. This is the fixed issue. Varnish servers (and only Varnish. This does not include the Squid servers. The Amsterdam caching servers are squids, where the North America caching servers are Varnish, at least last I heard) use a program called varnishhtcpd. When mediawiki sends a "Image foo has been updated" message to the caching servers, varnishhtcpd translates the message from the format MediaWiki uses (htcp) to the format that varnish uses (http PURGE). There was some problems with this program, that caused it to freeze. If the program freezed, the varnish servers could no longer understand the purge messages MediaWiki was sending. The symptoms of this bug was that only people accessing via North America, since only North America had the varnishes (The Squid servers in Europe could understand HTCP purges by themselves). This bug was tracked by bug 43448.

Note: there were some brief periods where after (3) was fixed before (2) cropped up, where people were mostly happy.

Additionally there is bug 43449, to add monitoring so that warning bells (other than a horde of angry users) go off when cache purging fails.

There is is also bug 44269 which seems to not have anything to do with the caching servers as far as I can tell.


Current situation - Just testing now. Purging an image seems to result in the cache being cleared both in europe and North America. This suggests that problem 2 is indeed fixed (yay Leslie and anyone else involved!), which leaves us just with problem 1. If that is the case, the work arounds should generally work.

To re-iterate, the work around is as follows. If File:Example.svg was not updating for the 200px size, you would do the following:

  1. Go to http://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/200px-Example.svg.png?RandomNumberHere1234 (replacing RandomNumberHere1234 with something random)
  2. Go to http://commons.wikimedia.org/wiki/File:Example.svg?action=purge
Bawolff added a comment.Via ConduitJan 30 2013, 3:25 AM

Current situation - Just testing now. Purging an image seems to result in the
cache being cleared both in europe and North America. This suggests that
problem 2 is indeed fixed (yay Leslie and anyone else involved!), which leaves
us just with problem 1. If that is the case, the work arounds should generally
work.

After testing a bunch more, it seems the workingness is a bit intermittent.

In one test, I did https://commons.wikimedia.org/wiki/File:Moscow_metro_map_ru_sb_future.svg?action=purge . I then looked at the age header. When accessing via caches in north america, the age header was reset as expected (yay!).

However, when accessing via the europe caching servers [1] there was a rather unexpected result. Sometimes a varnish server responded (specifically the response had the header X-Cache: cp1033 hit (1), cp3010 miss (0), cp3009 frontend hit (2) ). When this happened the age header had been recently reset as expected. This goes beyond my knowledge of WMF's network setup, but I'm guessing that sometimes cache requests gets forwarded from esams to eqiad(?) since cp1033 is the cache server that seems to respond from eqiad too (Then again, I could be totally confused here).

When I got a response from a squid server from esams, the age header was not reset. (It was 57479 = 15 hours, so nothing was horrendously old, hence htcp purges were getting there recently, but they aren't at the precise moment of me writing this).

At the same time, tests I did of purging articles resulted in the cache being cleared both for the squids at esams, and for the varnish in eqiad, so it seems like htcp purges are being delievered properly.

The conclusion I draw from this is:
*I really have no idea :s Wild guesses include: Only the upload squid servers are for some reason not getting the htcp multicast purges, and only sometimes? The squid servers are overloaded? (However timing seems too coincidental for that to happen, also I would expect varnish to get overloaded first as it has the extra overhead of converting htcp -> http purge request).

It would be nice if ops (or other powers to be) could comment on what they think the status of multicast htcp purges working is. In various places there have been comments of "we think this is fixed now", but no one has explicitly said any of the following:
*"The issue is 100% fixed and we're not worrying about it any longer"
*"We managed to get things sort of working, but there's still some issues, and we're looking into them"
*"Things are horribly horribly broken, and we're doing the best we can to sort things out"
*The issue has some other status.

It would be really nice if we could have such a comment about this issue of the aforementioned nature.


[1] To simulate accessing from europe, I used commands of the form:

wget -U bawolff -S --header 'host: upload.wikimedia.org' --no-check-certificate 'https://upload-lb.esams.wikimedia.org/wikipedia/commons/3/3d/Moscow_metro_map_ru_sb_future.svg'

bzimport added a comment.Via ConduitJan 30 2013, 12:43 PM

mr.heat wrote:

(In reply to comment #86)

the work around is as follows. If File:Example.svg was not updating for
the 200px size, you would do the following:

  1. Go to http://upload.wikimedia.org/wikipedia/commons/thumb/8/84/ Example.svg/200px-Example.svg.png?RandomNumberHere1234 (replacing RandomNumberHere1234 with something random)
  2. Go to http://commons.wikimedia.org/wiki/File:Example.svg?action=purge

Tried again. Does not work. The files

http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Wappen_Landkreis_Aurich.svg/140px-Wappen_Landkreis_Aurich.svg.png
http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Wappen_Landkreis_Aurich.svg/423px-Wappen_Landkreis_Aurich.svg.png

still show the old and the new version randomly when pressing F5.

Bawolff added a comment.Via ConduitJan 30 2013, 4:23 PM

(In reply to comment #88)

(In reply to comment #86)
> the work around is as follows. If File:Example.svg was not updating for
> the 200px size, you would do the following:
> 1. Go to http://upload.wikimedia.org/wikipedia/commons/thumb/8/84/
> Example.svg/200px-Example.svg.png?RandomNumberHere1234
> (replacing RandomNumberHere1234 with something random)
> 2. Go to http://commons.wikimedia.org/wiki/File:Example.svg?action=purge

Tried again. Does not work. The files

http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/
Wappen_Landkreis_Aurich.svg/140px-Wappen_Landkreis_Aurich.svg.png
http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/
Wappen_Landkreis_Aurich.svg/423px-Wappen_Landkreis_Aurich.svg.png

still show the old and the new version randomly when pressing F5.

Confirmed, I'm seeing the same thing. I think this issue was only fixed for a couple hours and then broke again.

I'm splitting comment 87 - 88 off into bug 44508.

Aklapper added a comment.Via ConduitJan 31 2013, 11:32 AM

This report mixes various numbers of issues which are now handled in separate reports, with one problem per report (see comment 86 and comment 89 for links).

In case that cache purging issues come up that are not handled by one the existing reports, please file a new report. General recommendations are available at https://www.mediawiki.org/wiki/How_to_report_a_bug

It has become rather impossible to define at which state (after some actions or "partial fixes") this report could be considered FIXED due to several issues reported here. Hence I am closing this report as INVALID.
This just refers to this report, it does not mean that the described problems are INVALID - they are just handled in separate tickets now.

I'd like to thank everybody for the helpful input which led to identifying and separating several problems in Wikimedia's infrastructure, some of them having received fixes already.

Aklapper added a comment.Via ConduitJan 31 2013, 11:33 AM

(In reply to comment #89)

I'm splitting comment 87 - 88 off into bug 44508.

Thanks a lot for doing that!
Bug 44508 has been handled and three testers state that it is fixed now.

Gilles raised the priority of this task from "Normal" to "Unbreak Now!".Via WebDec 4 2014, 10:23 AM
Gilles moved this task to Closed on the Multimedia workboard.
Gilles added a project: Multimedia.
Gilles lowered the priority of this task from "Unbreak Now!" to "Normal".Via ConduitDec 4 2014, 11:23 AM

Add Comment