Page MenuHomePhabricator

Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage
Closed, ResolvedPublic

Description

Please see https://commons.wikimedia.org/wiki/Commons:Village_pump#File_not_updated for a discussion of the problem. We can't get rid of the image with the cloud on page https://commons.wikimedia.org/wiki/File:Ariano_Irpino_ZI.jpeg .

Revisions and Commits

rTHMBREXT Thumbor Plugins
Restricted Differential Revision

Event Timeline

Thanks for reporting this.
https://commons.wikimedia.org/wiki/File:Ariano_Irpino_ZI.jpeg shows an image with a cloud and in the "File History" section the "current" version is an image with a cloud.
It would be helpful if this task followed https://www.mediawiki.org/wiki/How_to_report_a_bug and had specific sections for 1) steps to reproduce, 2) expected outcome, 3) actual outcome.

Aklapper changed the task status from Open to Stalled.Mar 29 2018, 5:29 PM

If this is about reverting to the non-cloud version and the preview on top of the File: page still shows the cloud version, you may want to try https://en.wikipedia.org/wiki/Wikipedia:Purge

zhuyifei1999 changed the task status from Stalled to Open.EditedMar 29 2018, 7:11 PM

Look like it's thumbnails did not get purged. 'Original file' https://upload.wikimedia.org/wikipedia/commons/c/ca/Ariano_Irpino_ZI.jpeg is correct, but old thumbnails like https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/320px-Ariano_Irpino_ZI.jpeg are not. Thumbor do get the correct version without the cloud if the size of the thumbnails is replaced to a random value.

you may want to try https://en.wikipedia.org/wiki/Wikipedia:Purge

AFAIK, client-side purging on upload.wm.o has been intentionally disabled on the varnish side.

zhuyifei1999 renamed this task from Caching problem with file description page on Commons to Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads.Mar 29 2018, 7:13 PM

AFAIK, client-side purging on upload.wm.o has been intentionally disabled on the varnish side.

It's not disabled, it's that Varnish doesn't have any client-facing PURGE functionality at all. Client-operated purging goes via MediaWiki, which later tells Varnish what to do. So to purge an image from the upload.wikimedia.org caches, you purge via MediaWiki on the origin wiki of the image (commons.wikimedia.org in this case).

So to purge an image from the upload.wikimedia.org caches, you purge via MediaWiki on the origin wiki of the image (commons.wikimedia.org in this case).

From my experience, I don't think action=purge goes issues a purge on the image thumbnails, but only a purge to the html file description. We have been issuing purges via reuploads (see the file upload history), but they did not work.

It definitely does purge Varnish for the original file on upload.wikimedia.org. An in general, when a replacement file is updated, MW does track the thumbnails in a database, and generally does purge them when new copies of the original are updated. Whether this is properly hooked up and tracked on the MW side for manual purges, I couldn't say.

To be sure of what I'm saying, and perhaps provide some trace data that may help pinpoint whatever the actual problem is, I traced all raw PURGE traffic at one of the upload.wikimedia.org Varnish caches, while executing my own request to action=purge for https://commons.wikimedia.org/wiki/File:Ariano_Irpino_ZI.jpeg . This is the set of URLs that were immediately purged in near-realtime on upload.wikimedia.org as a result of my action=purge (seems to clearly include both thumbnails and old revs of the file):

/wikipedia/commons/archive/c/ca/20171217123029%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180303205055%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180308111850%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180308113335%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180327050717%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180327064617%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180327065649%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180329135201%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/archive/c/ca/20180329135918%21Ariano_Irpino_ZI.jpeg
/wikipedia/commons/c/ca/Ariano_Irpino_ZI.jpeg
/wikipedia/commons/thumb/archive/c/ca/20171217123029%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20171217123029%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180303205055%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180303205055%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180308111850%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180308111850%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180308113335%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180308113335%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180327050717%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180327050717%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180327064617%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180327064617%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180327065649%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180327065649%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180329135201%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180329135201%21Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/archive/c/ca/20180329135918%21Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/100px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/1024px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/120px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/150px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/180px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/200px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/220px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/240px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/250px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/320px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/330px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/341px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/440px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/450px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/454px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/457px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/500px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/640px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/800px-Ariano_Irpino_ZI.jpg
/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/80px-Ariano_Irpino_ZI.jpg

Interesting. I'm unaware of that. Thanks.

I issued two more purges, but https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/800px-Ariano_Irpino_ZI.jpeg still clearly shows the cloud. A random pixel value like 801px https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/801px-Ariano_Irpino_ZI.jpeg shows the thumbnail without clouds, so it's very unlikely that it's an issue with thumbor making the clouds.

Hmmm, now I'm noting what is probably the critical discrepancy here....

When I visit https://commons.wikimedia.org/wiki/File:Ariano_Irpino_ZI.jpeg , the link there for Other resolutions: 320 × 205 pixels, which is also what @zhuyifei1999 quoted as a problematic 320px thumbnail, is :

/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/320px-Ariano_Irpino_ZI.jpeg

Whereas in the PURGE logs, the 320px thumb is being purge as:

/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/320px-Ariano_Irpino_ZI.jpg

Note they differ in jpg vs jpeg. Apparently both URLs exist and work from a client perspective. This duplication (with apparently-different senses of which is canonical for links on commons vs purges?) is a problem in general. It's like the encoding normalization issues, but even worse/deeper.

Does someone on the MW / Commons / Thumbor side have a good handle on whatever the current situation is for apparently jpg-vs-jpeg filename aliasing?

It looks like there may be two issue going on here.

  1. MediaWiki's canonical url of thumbnails can use .jpeg if the original does so too. Despite this, something somewhere (possibly specific to Wikimedia's setup), is causing the PURGE sent by MediaWiki to use .jpg. This is a bug.
  1. The thumbnail stack seems to tolerate any suffix on the thumbnail, without validation. Indeed. The non-standard .jpg suffix isn' the only one that "works". Extension .xyz also works, but also .svg, or .tiff.exe.

This second issue is likely what's been hiding the problem so far, because it causes the invalid or non-canonical url to get a successful response, instead of the HTTP 4xx error it should've been given.

I can reproduce this issue on plain MediaWiki and Apache when adding 404 thumb handler, and disabling $wgGenerateThumbnailOnParse.

It seems the pattern router in MediaWiki only looks at the original file name and the parameters at the start of the last segment. Anything dash-separated segments that are unrecognised are silently ignored. Any content after the last dash, including the file name and extension, are ignored.

On my localhost with the default rewrite rule documented on mediawiki.org, even the following url worked:

/mediawiki/images/docker/default/thumb/a/a9/Example.jpg/10px-Example.jpg
/mediawiki/images/docker/default/thumb/a/a9/Example.jpg/10px-Any_thing
/mediawiki/images/docker/default/thumb/a/a9/Example.jpg/10px- # Nothing

With anything after 16px- being both optional and ignored. In wmf production, the rewrite rules and routers end up a tad bit more strict. The only additional requirement in Wikimedia's configuration is that the portion after the last dash must be non-empty and have an extension of some sort. In the case of our example file that means 320px-a.b works, as well as 320px-Any.thing.exe.

Never mind about the part where I said I can reproduce it locally on plain MediaWiki. It is true that taking the rewrite rules for thumb.php (permalink) result in all the madness I described above.

However, I have no idea why that page contains rewrite rules. The file that is meant to be invoked for MediaWiki's transformVia404 feature (aka "404 thumb handler") is thumb_handler.php. That page is documented at https://www.mediawiki.org/wiki/Manual:Thumb_handler.php and contains very similar rewrite rules, but differ in two important aspects:

- RewriteRule ^/?w/images/thumb/[0-9a-f]/[0-9a-f][0-9a-f]/([^/]+)/([0-9]+)px-.*$ /w/thumb.php?f=$1&width=$2 [L,QSA,B]
+ RewriteRule ^/?w/images/thumb/[0-9a-f]/[0-9a-f][0-9a-f]/[^/]+/[^/]+$ /w/thumb_handler.php [L,QSA]
  1. They forward to thumb_handler.php, not thumb.php
  2. They don't set query parameters, instead they let MediaWiki do the routing.

Due to the former (thumb.php) being explicitly called with just two query parameters extracted from the url, it naturally allows everything else to be bogus because MediaWiki never sees it in the first place. I've removed this from the thumb.php documentation page because that file should never be used as the target of the 404 handler. That's what thumb_handler.php is for.

thumb_handler.php is really just a wrapper around thumb.php, but the important difference is that it reads the whole url, and validates it against the form it expects. And, it throws an error for anything that's invalid or unexpected. So we're back "issue 2" being specific to Wikimedia's configuration and not in MediaWiki core.

The first part of the thumb URL is insufficient information, because you're forgetting that there's no 1-to-1 mapping between original format and output format. We support generating JPGs and PNGs for PDF originals, for example. What's after px- isn't optional nor ignored, the very end of the thumbnail url is precisely what tells us what output format is needed. This is the regex worth looking at, from Thumbor's images.py. It's much clearer than MediaWiki where the logic is spread out in PHP:

def regex(cls):
    return (
        r'/'
        r'(?P<project>[^-/]+)/'
        r'(?P<language>[^/]+)/'
        r'thumb/'
        r'(?:(?P<specialpath>temp|archive)/)?'
        r'(?P<shard1>[0-9a-zA-Z]+)/'
        r'(?P<shard2>[0-9a-zA-Z]+)/'
        r'(?P<filename>[^/]+)\.'
        r'(?P<extension>[^/]+)/'
        r'(?:(?P<qlow>qlow-))?'
        r'(?:(?P<lossy>lossy-))?'
        r'(?:(?P<lossless>lossless-))?'
        r'(?:page(?P<page>\d+)-)?'
        r'(?:lang(?P<lang>[0-9a-zA-Z-]+)-)?'
        r'(?P<width>\d+)px-'
        r'(?:(?:seek=|seek%3D)(?P<seek>\d+)-)?'
        r'(?P<end>[^/]+)'
        r'\.(?P<format>[0-9a-zA-Z]+)'
    )

Whatever happens in thumb.php/thumb-handler.php is irrelevant for public wikis, as it's entirely bypassed.

MediaWiki is the source of the issue because it generates inconsistent display URLs and PURGE URLs for the same thumbnails. That's the real problem here.

All of this would be a lot more foolproof purging by XKey. Thumbor includes an XKey in every thumbnail in the form of the full original "path" with namespace. Eg.

xkey: File:Ariano_Irpino_ZI.jpeg

Thumbor will attempt to convert to whatever format is at the end, or default to JPG output. Sure, we could block unknown extensions thrown at it, but that requires maintaining a configuration whitelist that might go out of sync. Not only that, blacklisting ".jpeg" would just result in broken thumbnails if MediaWiki keeps pointing to such thumbnail URLs, as it seems to do now. Changing that is looking at the wrong end of the problem, attempting to fix the symptoms and not the cause. You could generate just as much cache splitting inserting junk between px- and the extension part (something that's likely also possible with plain MediaWiki). What matters at the end of the day is that MediaWiki isn't a source of cache splitting malformed thumbnail URLs.

I think XKeys are the better fix here, because they would catch anything Thumbor has generated involving that file, including junk URLs that might not have originated from MediaWiki. It's a more foolproof way of purging without worrying about all clients using an identical format for thumbnail URLs. Uniformity is desirable for caching but hard to maintain, and not as critical as making sure that purging something purges all variants of it.

We would still need to verify that the part of purging that deletes Swift objects does delete all the thumbnails. Which is unclear at this point.

I think I've got it. MediaWiki just goes by what it finds in Swift, as is visible in this log call: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.03.30/mediawiki?id=AWJ12Ek5hjEUnI8oaSmq&_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now)) that includes all the junk extensions we've triggered manually.

But I suspect that Thumbor does some normalization of jp* extension that happens too early:

if kw['format'].lower() in ('jpe', 'jpeg'):
    kw['format'] = 'jpg'

Which most likely results in the thumbnail being fetched and stored in Swift under .jpg no matter what. MediaWiki pulls the list from Swift and feeds the purge list with the .jpg version.

That normalization in Thumbor is for internal logic and IM's sake, Swift storage being affected is an undesirable side-effect.

Gilles renamed this task from Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads to Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage.Mar 30 2018, 7:51 AM
Gilles claimed this task.
Gilles triaged this task as Medium priority.
Gilles added a revision: Restricted Differential Revision.Mar 30 2018, 8:07 AM

You could generate just as much cache splitting inserting junk between px- and the extension part (something that's likely also possible with plain MediaWiki)

For what it's worth, when testing with thumb_handler.php, I found handling to be extremely strict. (Much more than I thought!) It rejects bogus prefix options, unmatching filename, and invalid file extensions. They all get an early 4xx response.

I didn't look at the implementation, but from the outside, it behaves the same as if it were to parses the url, looks at what it understood, construct the canonical url for that, and reject if it doesn't match. I didn't try very long, but in quick testing I could not find two ways to get the same (successful) response. (The only exception being the query string.)

thumb_handler.php is for private wikis, though, afaik, public wikis hit thumb.php

thumb.php is for URLs which contain the filename and size as query parameters (either hardcoded into the HTML or via an URL rewrite). thumb_handler.php (which is really just a non-standard entry point into the thumb.php code) is for loading images via the 404 handler, with an URL that matches the image directory structure (so calling the same URL the second time would result in the web server finding the thumbnail and not calling PHP). They both work for private wikis (and there is also img_auth.php for serving original files for private wikis).

Change 419172 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.16

https://gerrit.wikimedia.org/r/419172

Change 419172 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 1.16

https://gerrit.wikimedia.org/r/419172

The issue should fix itself (and it has, for the file mentioned in the task description) as thumbnails of .jpe and .jpeg files gradually go out of Varnish. Swift's rewrite.py will look for them in their canonical location, not find them, have them rendered by Thumbor who will know save them in the right location. From that point on, purges on that file will catch the broken outdated file locations as well as the correct ones.