Page MenuHomePhabricator

Limit number of image scaling attempts
Closed, DeclinedPublic

Description

Currently, requests for thumbnails are attempted, then possibly failed (e.g. due to memory limits reached, or imagemagick crashing), then retried until the end of time, as long as someone requests such a URL. There are some known offenders, like a 3MB animated DNA Helix image I've been seeing on imagescalers for months. There are also incidents such as this:

-rw-r--r-- 1 apache apache 19K May 25 11:46 localcopy_0ef74f0b53b6-1.svg
-rw-r--r-- 1 apache apache 19K May 25 21:30 localcopy_ca923e0f99a6-1.svg
-rw-r--r-- 1 apache apache 19K May 26 10:48 localcopy_572f861613b9-1.svg
-rw-r--r-- 1 apache apache 19K May 26 11:18 localcopy_def9292ef643-1.svg
-rw-r--r-- 1 apache apache 19K May 26 11:19 localcopy_0572c50518be-1.svg
-rw-r--r-- 1 apache apache 19K May 26 12:01 localcopy_304cb07a6b45-1.svg
-rw-r--r-- 1 apache apache 19K May 26 13:14 localcopy_fb22d6ef358b-1.svg
-rw-r--r-- 1 apache apache 19K May 26 13:19 localcopy_16568ad8072b-1.svg
-rw-r--r-- 1 apache apache 19K May 26 13:44 localcopy_0a9a32f4877c-1.svg
-rw-r--r-- 1 apache apache 19K May 26 14:23 localcopy_813c0c9416bc-1.svg
-rw-r--r-- 1 apache apache 19K May 26 15:58 localcopy_969e57b3aec0-1.svg
-rw-r--r-- 1 apache apache 19K May 26 16:34 localcopy_de13c2c069d6-1.svg
-rw-r--r-- 1 apache apache 19K May 26 17:54 localcopy_6e9b4a4aafbc-1.svg
-rw-r--r-- 1 apache apache 19K May 27 09:09 localcopy_15e927467d8f-1.svg
-rw-r--r-- 1 apache apache 19K May 27 10:09 localcopy_8e31a32bf69a-1.svg
-rw-r--r-- 1 apache apache 19K May 27 12:37 localcopy_9fb1ac64e9d2-1.svg
-rw-r--r-- 1 apache apache 19K May 27 12:38 localcopy_7f2b88f782a3-1.svg
-rw-r--r-- 1 apache apache 19K May 28 02:46 localcopy_72abccb50bdf-1.svg
-rw-r--r-- 1 apache apache 19K May 28 02:46 localcopy_aa5f4680fab2-1.svg
-rw-r--r-- 1 apache apache 19K May 28 02:50 localcopy_8023d355393a-1.svg
-rw-r--r-- 1 apache apache 19K May 28 11:19 localcopy_f2dab828e134-1.svg
-rw-r--r-- 1 apache apache 19K May 29 16:05 localcopy_0ebbf8feb6a7-1.svg
-rw-r--r-- 1 apache apache 19K May 29 18:45 localcopy_c5f2dbd0d36d-1.svg
-rw-r--r-- 1 apache apache 19K May 29 18:58 localcopy_4a87fe4aa75f-1.svg
-rw-r--r-- 1 apache apache 19K May 29 20:40 localcopy_b9df425c39f7-1.svg
-rw-r--r-- 1 apache apache 19K May 29 20:42 localcopy_68fafb94b201-1.svg
-rw-r--r-- 1 apache apache 19K May 29 20:43 localcopy_28e26296e19c-1.svg
-rw-r--r-- 1 apache apache 19K May 29 20:51 localcopy_83c7795c9538-1.svg
-rw-r--r-- 1 apache apache 19K May 29 20:54 localcopy_923d3ff45ad6-1.svg
-rw-r--r-- 1 apache apache 19K May 29 20:55 localcopy_ff8795390f2e-1.svg
-rw-r--r-- 1 apache apache 19K May 30 05:35 localcopy_728ca2ce9d32-1.svg
-rw-r--r-- 1 apache apache 19K May 30 10:14 localcopy_3a483ff65908-1.svg
-rw-r--r-- 1 apache apache 19K May 30 16:58 localcopy_bde12d83680d-1.svg
-rw-r--r-- 1 apache apache 19K May 30 16:59 localcopy_12aa3e76ef45-1.svg
-rw-r--r-- 1 apache apache 19K May 30 17:00 localcopy_078234140633-1.svg
-rw-r--r-- 1 apache apache 19K May 31 16:44 localcopy_1b44bb384189-1.svg
-rw-r--r-- 1 apache apache 19K Jun 1 15:14 localcopy_d71bc6e4f94d-1.svg
-rw-r--r-- 1 apache apache 19K Jun 2 17:53 localcopy_c3d7a0a674a5-1.svg
-rw-r--r-- 1 apache apache 19K Jun 3 15:16 localcopy_f17379c9ae4a-1.svg

Please implement a way of a limit on the number of attempts (3-5 should be enough) in the backend. Other options, such as Swift/Varnish caching the 500 after two or three restarts were considered and discussed but they're uglier and harder to implement. Having MediaWiki do it when a scaling attempt fails instead looks like the more reasonable way to go.

We should also have a way of resetting such a counter either per image or en masse, as the underlying causes of the failure could be fixed in the meantime (e.g. a newer imagemagick version, an increase of our memory limits, a MediaWiki bug getting fixed etc.)


Version: 1.22.0
Severity: normal
Whiteboard: perfsprint-13
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=52045
https://bugzilla.wikimedia.org/show_bug.cgi?id=65217

Details

Reference
bz49118

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 1:46 AM
bzimport set Reference to bz49118.
bzimport added a subscriber: Unknown Object (MLST).
faidon created this task.Jun 4 2013, 1:31 PM

Change 107419 had a related patch set uploaded by Aaron Schulz:
Limit attempts to render the same thumbnail after failures

https://gerrit.wikimedia.org/r/107419

aaron added a comment.Apr 21 2014, 4:46 PM

Also improved in https://gerrit.wikimedia.org/r/#/c/127642/. This can be closed.

The patchset is a good first step, but I don't think it's enough.

First of all, if I'm reading the code right, the attempt key is keyed by original _and_ thumb size; my impression is that almost always, scaling errors are because of properties of the original file, not the thumb size specifically (very large, like the 100MB TIFFs, or very complex, like the animated DNA helix). I think that it'd be better to just count hits with just the original, as otherwise we are getting DoSed by people hitting the same image in different pages, e.g. Special:NewFiles' 120px vs. Special:ListFiles' 180px.

More importantly, the code seems to add a key in memcache with TTL 3600. This won't actually fix the effect that the /tmp listing presented in the original description, since those images were apart hours, or even days from each other. It will also not protect us from the general issue of "unscalable images", images that we can never render with our current thresholds but we will forever try to (hundreds of megabytes old multipage TIFFs, for instance).

I think a short memcached-based approach is fine for short temporary errors, like protecting us from bursts, but my original thought -and one that would probably save us from today's recurring rendering outages- would be to actually store this in a flag the image table in the database and *never* try again, unless the flag is reset by an admin or the epoch is bumped. The flag could even be set to the current date, as to be able to quickly reset it with a query, if our infrastructure goes nuts for a day and we get flooded with false positives.

aaron added a comment.Apr 21 2014, 6:23 PM

(In reply to Faidon Liambotis from comment #3)

The patchset is a good first step, but I don't think it's enough.
First of all, if I'm reading the code right, the attempt key is keyed by
original _and_ thumb size; my impression is that almost always, scaling
errors are because of properties of the original file, not the thumb size
specifically (very large, like the 100MB TIFFs, or very complex, like the
animated DNA helix). I think that it'd be better to just count hits with
just the original, as otherwise we are getting DoSed by people hitting the
same image in different pages, e.g. Special:NewFiles' 120px vs.
Special:ListFiles' 180px.

Sometiles thumb.php calls fail due to the requeted size being to big, with smaller sizes working. A per-file limit would punish all renderings. We already have per-user rate limiting on thumbnails/min and "non-standard" thumbnails/min. I'd hope that would help with the DOS aspect. For the non DOS aspect, the failure limiting should mostly be enough.

More importantly, the code seems to add a key in memcache with TTL 3600.
This won't actually fix the effect that the /tmp listing presented in the
original description, since those images were apart hours, or even days from
each other. It will also not protect us from the general issue of
"unscalable images", images that we can never render with our current
thresholds but we will forever try to (hundreds of megabytes old multipage
TIFFs, for instance).
I think a short memcached-based approach is fine for short temporary errors,
like protecting us from bursts, but my original thought -and one that would
probably save us from today's recurring rendering outages- would be to
actually store this in a flag the image table in the database and *never*
try again, unless the flag is reset by an admin or the epoch is bumped. The
flag could even be set to the current date, as to be able to quickly reset
it with a query, if our infrastructure goes nuts for a day and we get
flooded with false positives.

If the media handler class bound the the file knows that it cannot be rendered (based on the format, size, and configured resource limits, ect...) it could store it as img_metadata and bail out quickly. There isn't really a way for thumb.php itself to make that determination afaik. Maybe the Multimedia team can look into media handler tweaks.

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:39 PM
Restricted Application added subscribers: Steinsplitter, Matanya, Aklapper. · View Herald TranscriptSep 4 2015, 6:39 PM
Restricted Application added a project: Commons. · View Herald TranscriptMar 27 2017, 11:51 PM
jijiki closed this task as Resolved.Feb 22 2019, 7:59 PM
jijiki claimed this task.
jijiki reopened this task as Open.Feb 22 2019, 8:02 PM
jijiki closed this task as Resolved.

I am closing this since we don't have imagescalers any more.

Tgr changed the task status from Resolved to Declined.Feb 22 2019, 9:52 PM
Tgr added a subscriber: Tgr.

This was not done on a per-file level (per the reasoning in T51118#544170, I imagine) so adjusting state to reflect that better.