Page MenuHomePhabricator

Attempt to validate images before operating on them
Open, Needs TriagePublic

Description

Given the volume and variety of images Thumbor deals with it's inevitable that it will encounter corrupt or otherwise invalid images. Currently when we attempt to transform an image that is not valid for its filetype we will make a best effort and then fail with a 500 error. This is neither specific enough to be useful to users nor for operators of the service as it is indistinguishable from genuine service failures unless specifically investigated. We saw an example of this in T327941.

We should make an attempt to validate that a file is valid for its format before attempting to resize or convert it, and either

  • refuse to operate on it outright, or
  • note that the file is potentially invalid and, upon failure to transform it, return this as the reason for failure rather than a generic service failure.

In any case it should be clear both to the user and the service operator that the image conversion failed because of the source image. This will help identify other, real internal service failures.

Obviously dealing with a large number of formats makes this a broad problem - Tool candidates for this are (amongst others):

  • jpeginfo -c
  • pngcheck
  • libtiff-tools

Event Timeline

Id be careful with this. There is lots of invalid and non compliant media out there, and like html and browsers, clients are generally rather tolerant to them. Making this stricter could invalidate lots of older uploads.

Id be careful with this. There is lots of invalid and non compliant media out there, and like html and browsers, clients are generally rather tolerant to them. Making this stricter could invalidate lots of older uploads.

Yeah I definitely lean more towards annotating and differentiating between failures that are already happening rather than refusing to operate on existing images for this reason. We'll need to broaden our unit test cases pretty actively before implementing this also

IME most thumbnail failures are not caused by strictly invalid files, but by rendering that exceeds the 1 minute timeout or the ImageMagick resource limits. There are a few instances where this is not the case (T285875: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" comes to mind), but this should generally be solved on the MediaWiki side, not the Thumbor side. ImageMagick and the other converters are also often perfectly happy to thumbnail invalid files while failing on apparently-valid files.

If validation is implemented, it may be a better idea to only attempt validation for files that fail rendering. That would prevent false-positive detection as well as prevent an increase in total thumbnailing time for most files.

Anecdata: Over the course of 5 minutes I analysed the few 503 responses gathered from the k8s workers - of 15 gathered, 6 of them were due to invalid PNGs that could not be thumbnailed due to invalid data according to pngcheck.

I'd in no way advocate for refusing to operate on invalid images, but being able to return an appropriate 4xx error (would it be an abuse of a 415?) rather than a 5xx error would make detecting real issues much easier.

In general, a 4XX response code would be incorrect for thumbnail requests for already-uploaded images. There is nothing that the HTTP client can do about a bad file having been uploaded.