Page MenuHomePhabricator

Thumbor should reject thumbnail requests that are the same size as the original or bigger
Closed, DeclinedPublic

Description

Mediawiki currently 400s where Thumbor serves the result:

Nov 15 11:20:37 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 400 Thumbor: 200 URL: http://upload.wikimedia.org/wikipedia/en/thumb/c/cc/Windows_Media_Center_logo.png/64px-Windows_Media_Center_logo.png

Upscaling and same-size requests should 400 in Thumbor too.

The selected solution is store an extra X-Content-Dimensions header on originals in Swift with dimension information about the file. The header is to be populated on upload, and Thumbor will check it as early as possible to reject invalid requests.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 353078 merged by Filippo Giunchedi:
[operations/puppet@production] Whitelist X-Content-Dimensions in swift

https://gerrit.wikimedia.org/r/353078

Change 353077 merged by jenkins-bot:
[mediawiki/vagrant@master] Whitelist x-content-dimensions in Swift

https://gerrit.wikimedia.org/r/353077

Change 351788 merged by jenkins-bot:
[mediawiki/extensions/PdfHandler@master] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/351788

Change 351789 merged by jenkins-bot:
[mediawiki/extensions/PagedTiffHandler@master] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/351789

Change 353112 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/TimedMediaHandler@wmf/1.30.0-wmf.1] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/353112

Change 353113 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PdfHandler@wmf/1.30.0-wmf.1] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/353113

Change 353114 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PagedTiffHandler@wmf/1.30.0-wmf.1] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/353114

Change 351787 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@master] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/351787

Change 353126 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@wmf/1.30.0-wmf.1] Update git submodules

https://gerrit.wikimedia.org/r/353126

Change 353114 merged by jenkins-bot:
[mediawiki/extensions/PagedTiffHandler@wmf/1.30.0-wmf.1] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/353114

Change 353112 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@wmf/1.30.0-wmf.1] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/353112

Change 353126 abandoned by Gilles:
Update git submodules

Reason:
This is done automatically now...

https://gerrit.wikimedia.org/r/353126

Change 353113 merged by jenkins-bot:
[mediawiki/extensions/PdfHandler@wmf/1.30.0-wmf.1] Store original media dimensions as additional header

https://gerrit.wikimedia.org/r/353113

Mentioned in SAL (#wikimedia-operations) [2017-05-10T19:26:31Z] <dereckson@tin> Synchronized php-1.30.0-wmf.1/extensions/PagedTiffHandler/PagedTiffHandler_body.php: Store original media dimensions as additional header (T150741) (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-05-10T19:28:36Z] <dereckson@tin> Synchronized php-1.30.0-wmf.1/extensions/PdfHandler/PdfHandler_body.php: Store original media dimensions as additional header (T150741) (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-05-10T19:29:26Z] <dereckson@tin> Synchronized php-1.30.0-wmf.1/extensions/TimedMediaHandler/: Store original media dimensions as additional header (T150741) (duration: 00m 43s)

Deployed on testwiki, works for most formats, except:

  • djvu
  • webm
  • ogv

I find it a bit suspicious that both video ones don't work. They get the X-Content-Duration header correctly, though. Will investigate further.

Change 353268 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Add X-Content-Dimensions support to DjVu

https://gerrit.wikimedia.org/r/353268

Change 353268 merged by jenkins-bot:
[mediawiki/core@master] Add X-Content-Dimensions support to DjVu

https://gerrit.wikimedia.org/r/353268

Change 353504 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@wmf/1.30.0-wmf.1] Add X-Content-Dimensions support to DjVu

https://gerrit.wikimedia.org/r/353504

Change 353504 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.1] Add X-Content-Dimensions support to DjVu

https://gerrit.wikimedia.org/r/353504

Mentioned in SAL (#wikimedia-operations) [2017-05-15T13:47:10Z] <addshore@tin> Synchronized php-1.30.0-wmf.1/extensions/TimedMediaHandler/handlers: SWAT: [[gerrit:353505|Fix X-Content-Dimensions support]] T150741 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-05-15T13:48:41Z] <addshore@tin> Synchronized php-1.30.0-wmf.1/includes/media/DjVu.php: SWAT: [[gerrit:353504|Add X-Content-Dimensions support to DjVu]] T150741 (duration: 00m 39s)

Mentioned in SAL (#wikimedia-operations) [2017-05-15T20:48:16Z] <gilles> run refreshImageMetadata --force for group1 + group2 wikis except commons on terbium T150741

Change 353919 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Improve output of refreshImageMetadata and refreshFileHeaders

https://gerrit.wikimedia.org/r/353919

Gilles updated the task description. (Show Details)May 15 2017, 9:23 PM

I've just realized that X-Content-Dimensions doesn't apply EXIF rotation, which is necessary as the Swift headers alone won't tell if the current document/page is soft-rotated...

I.e. this file: https://commons.wikimedia.org/wiki/File:EXIF_rotation_90.jpg whose file page quotes the rotated size correctly, gets X-Content-Dimensions => 40x60:1

I will have to verify to what extent each format supports soft rotation. I believe PDF might?

Change 353990 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Apply EXIF rotation to X-Content-Dimensions

https://gerrit.wikimedia.org/r/353990

I didn't write similar integration tests for DjVu, because those depend on command line tools that aren't present by default. However I manually crafted a djvu document with a rotated page and the rotation is applied by djvutoxml, which means that Mediawiki doesn't need to do anything special.

Change 353919 merged by jenkins-bot:
[mediawiki/core@master] Improve output of refreshImageMetadata and refreshFileHeaders

https://gerrit.wikimedia.org/r/353919

TIFF support for orientation on Linux seems to be mostly broken, including on Mediawiki. Any orientation that should swap dimensions (eg. 90 CW) is interpreted as a different orientation value that doesn't swap dimensions. In this sea of brokenness, this means that for X-Content-Dimensions we should ignore metadata orientation for TIFFs (single or multipage). I.e. we don't need to anything more than what we're doing currently.

I've checked and PdfHandler already handles rotation like a champ, all the way to X-Content-Dimensions. I think the change I've made in core for JPGs is all we need to do, as the remaining formats I haven't mentioned here don't have metadata rotation features.

Change 353990 merged by jenkins-bot:
[mediawiki/core@master] Apply EXIF rotation to X-Content-Dimensions

https://gerrit.wikimedia.org/r/353990

Since the current migration scripts are very slow on Terbium when applied to all file types, we probably have to start looking at alternatives:

  • Special case single-page file types to not rely on getMetadata() but instead on getWidth()/getHeight() (plus rotation information) in getContentHeaders(). This means that we can skip refreshImageMetadata for those file types, but we don't avoid refreshHeaders. refreshHeaders on its own might still be too slow, though.
  • Lazy-populate the dimension information from thumbor. When the header is missing, we look at the dimension information and populate X-Content-Dimensions from there. This requires the Thumbor implementation of the code generating X-Content-Dimensions to match the Mediawiki one exactly.
  • Revisit the premise of the error itself and generate an original-sized thumbnails when the width requested is larger than the original.
  • Revisit the premise of the error itself and upscale to honor the requested width. Possibly with a maximum width possible in that case to avoid abuse, or a throttling mechanism for that case specifically, where you can request more than X upscaled thumbnails per minute for a given original. Throttling has the advantage of making the thumbnail "API" predictable and consistent, but not storing large thumbnails in swift becomes a requirement.

I've just realized that the existing migration scripts probably don't care about oldimage... which means that those wouldn't get the X-Content-Dimensions header. The cost of the migration thus gets even worse, since thumbnails can be generated for those old revisions.

We could always go the lazy-populating route, where X-Content-Dimensions is set by Mediawiki for new files and by Thumbor for old files, but refreshHeaders and its underlying functions should leave X-Content-Dimensions alone in Swift if Thumbor set it, in cases where Mediawiki is still dealing with stale metadata.

Maybe we can avoid that problem if I can work around the metadata issue completely for getContentHeaders, and Mediawiki can generate X-Content-Headers with existing data, not needing to add anything to the metadata field. However I tried that at some point and ran into complications because getContentHeaders only gets the metadata, but that might be worth another shot. Essentially we'd get rid of any change to the metadata field and the need to run refreshImageMetadata, only needing to run refreshHeaders. In which case it's a lot easier to do lazy updating of the Swift headers from Mediawiki and/or Thumbor.

Of course all of this can be avoided if we decide to get rid of the error and do fallback/upscaling instead...

Change 354724 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Use file width/height instead of metadata for getContentHeaders

https://gerrit.wikimedia.org/r/354724

Assuming the above change works and we only need to run refreshFileHeaders for the migration, let's estimate how long it would take for commons if we simply ran the migration script on Terbium.

Running refreshFileHeaders for testwiki, we're processing 4908 + 1212 = 6120 files. This took 71 minutes.

Commons has 39432392 + 3518817 = 42951209 files. Which means at this rate it should take about a year to migrate Commons by running refreshFileHeaders on Terbium...

At first glance it seems like refreshFileHeaders is being very dumb and talking to Swift over a single connection, waiting for Swift to acknowledge a change before modifying the next file, etc. I'll try to make it parallelise things more to see how much faster we can make it.

Change 354950 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/TimedMediaHandler@master] Make getContentHeaders rely on fallback width/height

https://gerrit.wikimedia.org/r/354950

Change 354951 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PdfHandler@master] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/354951

Change 354956 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PagedTiffHandler@master] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/354956

Change 355188 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Batch/pipeline backend operations in refreshFileHeaders

https://gerrit.wikimedia.org/r/355188

Change 355188 merged by jenkins-bot:
[mediawiki/core@master] Batch/pipeline backend operations in refreshFileHeaders

https://gerrit.wikimedia.org/r/355188

Change 354951 merged by jenkins-bot:
[mediawiki/extensions/PdfHandler@master] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/354951

Change 354724 merged by jenkins-bot:
[mediawiki/core@master] Use file width/height instead of metadata for getContentHeaders

https://gerrit.wikimedia.org/r/354724

Change 354956 merged by jenkins-bot:
[mediawiki/extensions/PagedTiffHandler@master] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/354956

Change 354950 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@master] Make getContentHeaders rely on fallback width/height

https://gerrit.wikimedia.org/r/354950

Change 355388 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PdfHandler@wmf/1.30.0-wmf.2] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/355388

Change 355389 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@wmf/1.30.0-wmf.2] Use file width/height instead of metadata for getContentHeaders

https://gerrit.wikimedia.org/r/355389

Change 355390 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@wmf/1.30.0-wmf.2] Batch/pipeline backend operations in refreshFileHeaders

https://gerrit.wikimedia.org/r/355390

Change 355405 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PagedTiffHandler@wmf/1.30.0-wmf.2] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/355405

Change 355406 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/TimedMediaHandler@wmf/1.30.0-wmf.2] Make getContentHeaders rely on fallback width/height

https://gerrit.wikimedia.org/r/355406

Change 355388 merged by jenkins-bot:
[mediawiki/extensions/PdfHandler@wmf/1.30.0-wmf.2] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/355388

Change 355389 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.2] Use file width/height instead of metadata for getContentHeaders

https://gerrit.wikimedia.org/r/355389

Change 355405 merged by jenkins-bot:
[mediawiki/extensions/PagedTiffHandler@wmf/1.30.0-wmf.2] Update getContentHeaders signature

https://gerrit.wikimedia.org/r/355405

Change 355406 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@wmf/1.30.0-wmf.2] Make getContentHeaders rely on fallback width/height

https://gerrit.wikimedia.org/r/355406

Change 355390 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.2] Batch/pipeline backend operations in refreshFileHeaders

https://gerrit.wikimedia.org/r/355390

Mentioned in SAL (#wikimedia-operations) [2017-05-24T18:54:48Z] <thcipriani@tin> Synchronized php-1.30.0-wmf.2/extensions/PdfHandler/PdfHandler_body.php: SWAT: [[gerrit:355388|Update getContentHeaders signature]] T150741 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-05-24T18:55:54Z] <thcipriani@tin> Synchronized php-1.30.0-wmf.2/extensions/PagedTiffHandler/PagedTiffHandler_body.php: SWAT: [[gerrit:355405|Update getContentHeaders signature]] T150741 (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-05-24T18:56:52Z] <thcipriani@tin> Synchronized php-1.30.0-wmf.2/extensions/TimedMediaHandler/handlers: SWAT: [[gerrit:355406|Make getContentHeaders rely on fallback width/height]] T150741 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-05-24T18:57:17Z] <thcipriani@tin> Started scap: SWAT: [[gerrit:355389|Use file width/height instead of metadata for getContentHeaders]] [[gerrit:355390|Batch/pipeline backend operations in refreshFileHeaders]] T150741

Mentioned in SAL (#wikimedia-operations) [2017-05-24T19:00:29Z] <thcipriani@tin> Finished scap: SWAT: [[gerrit:355389|Use file width/height instead of metadata for getContentHeaders]] [[gerrit:355390|Batch/pipeline backend operations in refreshFileHeaders]] T150741 (duration: 03m 12s)

refreshFileHeaders is still dreadfully slow even with batching and not needing to look at metadata, unfortunately. I think I'll focus on implementing the fallback in Thumbor first...

Change 356166 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Process content headers when metadata is empty

https://gerrit.wikimedia.org/r/356166

Change 356166 merged by jenkins-bot:
[mediawiki/core@master] Process content headers when metadata is empty

https://gerrit.wikimedia.org/r/356166

Gilles added a comment.Jun 8 2017, 3:45 PM

It seems like this type of file is still problematic and doesn't get an X-Content-Dimensions header: https://www.mediawiki.org/wiki/File:Access_control_lists.png

https://gerrit.wikimedia.org/r/#/c/356166/ was supposed to fix that particular case.

Gilles added a comment.EditedJun 8 2017, 3:55 PM

Nevermind, it's because mediawiki.org is still on wmf.2 when it should be on wmf.4???

Filed T167428: Mediawiki.org stuck on wmf.2 about it

Gilles added a comment.Jun 8 2017, 6:30 PM

Note to self: SVG needs to have upscaling unrestricted (the limit code should make an exception).

Gilles added a comment.Jun 8 2017, 7:44 PM

OK, I've looked at the different file types on Commons and realized that this limit is completely inconsistent:

WebP, returns original-sized thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Capture_of_Kantipur.webp/12000000px-Capture_of_Kantipur.webp.png

XCF, returns original-sized thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Abridgement_of_Christian_divinitie_page_9_I.xcf/120000px-Abridgement_of_Christian_divinitie_page_9_I.xcf.png

Djvu, unlimited upscaling, OOMs/500s quickly when asking for giant sizes:
https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/109_potraw.djvu/page1-2000px-109_potraw.djvu.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/109_potraw.djvu/page1-120000px-109_potraw.djvu.jpg

TIFF, returns original-sized thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/%22%22Socker%27s%22_Alley%22%2C_1962_-_NARA_-_558925.tif/lossy-page1-4000px-%22%22Socker%27s%22_Alley%22%2C_1962_-_NARA_-_558925.tif.jpg

SVG, seems capped at a maximum height of 4096, but returns the 4096 version for any bigger size requested: https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/%22Countless_squares%22.svg/100000px-%22Countless_squares%22.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/%2212_Wikivoyage_compass2_definitivo.svg/100000px-%2212_Wikivoyage_compass2_definitivo.svg.png

WebM, returns original-sized thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/%22High_Hopes%22_with_Jack_Kennedy.webm/100000px--%22High_Hopes%22_with_Jack_Kennedy.webm.jpg

OGV, returns original-sized thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/The_Commons.ogv/100000px--The_Commons.ogv.jpg

PDF, unlimited upscaling, OOMS/500s when asking for something gigantic:
https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/%22State_of_the_Movement%22_-_CEE_Meeting_2015.pdf/page2-10000px-%22State_of_the_Movement%22_-_CEE_Meeting_2015.pdf.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/%22State_of_the_Movement%22_-_CEE_Meeting_2015.pdf/page2-100000px-%22State_of_the_Movement%22_-_CEE_Meeting_2015.pdf.jpg

The limit only works for JPG, GIF and PNG. I started off this task assuming the limit worked everywhere, but now that I see how inconsistent it is, without the issue bothering anyone, I think it's just not worth reproducing in any form after all. We should just attempt to upscale to whatever size the user requests.

If users request something gigantic, it will fail due to OOM like with DjVU, hit the absolute size limit or get caught by the failure rate limiter.

I'm going to finish up the X-Content-Dimensions header on the Mediawiki side of things, but doing so isn't a blocker to launch. I'll put the consuming of the header on the Thumbor side behind a configuration flag and turn it off by default. We can revisit that if there are complaints about the limit being gone.

Gilles added a comment.Jun 8 2017, 8:06 PM

On second thought, let's get rid of it all. It was a fool's errand and nobody I've ever talked to who knew about the limit had any argument to keep it around.

Change 357895 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357895

Change 357896 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PdfHandler@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357896

Change 357897 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/PagedTiffHandler@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357897

Change 357898 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/TimedMediaHandler@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357898

The limit only works for JPG, GIF and PNG. I started off this task assuming the limit worked everywhere, but now that I see how inconsistent it is, without the issue bothering anyone, I think it's just not worth reproducing in any form after all. We should just attempt to upscale to whatever size the user requests.

The main difference between JPG/GIF/PNG and others is that the others are transformed in format. So basically, requesting 1000px size JPEG when the original is 1000px, the original is returned. Same for GIF and PNG afaik. The others, however, shouldn't return the original since they will be in the wrong format, we only render JPG/GIF/PNG to end-users as thumbnails.

I support creating thumbnails consistently for all formats, including these three. See also:
T67383: Generate optimised thumbnail even when dimensions match original

Change 357896 merged by jenkins-bot:
[mediawiki/extensions/PdfHandler@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357896

Change 357898 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357898

Change 357897 merged by jenkins-bot:
[mediawiki/extensions/PagedTiffHandler@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357897

Change 357895 merged by jenkins-bot:
[mediawiki/core@master] Remove X-Content-Dimensions header

https://gerrit.wikimedia.org/r/357895

@Gilles What that it, or is there a next step here?

Gilles closed this task as Declined.Jun 13 2017, 7:50 AM