Page MenuHomePhabricator

HTTP 500 error for some larger thumbnails of 14MB file File:Lençóis_Maranhenses_2018.jpg on Commons
Open, LowPublic

Event Timeline

Yann created this task.Jul 25 2018, 9:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 25 2018, 9:11 AM
ema added a subscriber: ema.Jul 25 2018, 9:17 AM

Here's the full thumbor response:

$ curl -v http://ms-fe.svc.eqiad.wmnet/wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/3000px-Voando_num_mar_de_areia.jpg
* Hostname was NOT found in DNS cache
*   Trying 10.2.2.27...
* Connected to ms-fe.svc.eqiad.wmnet (10.2.2.27) port 80 (#0)
> GET /wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/3000px-Voando_num_mar_de_areia.jpg HTTP/1.1
> User-Agent: curl/7.38.0
> Host: ms-fe.svc.eqiad.wmnet
> Accept: */*
> 
< HTTP/1.1 429 Error
< Content-Length: 0
< Thumbor-Memcache-Get-Time: 1
< Thumbor-Request-Date: Wed, 25 Jul 2018 09:10:14 GMT
* Server nginx/1.13.6 is not blacklisted
< Server: nginx/1.13.6
< Connection: close
< Cache-Control: no-cache
< Date: Wed, 25 Jul 2018 09:10:14 GMT
< Proxy-Request-Date: None
< Content-Type: text/html; charset=UTF-8
< X-Trans-Id: tx3d6d25c74b7c4313b5e96-005b583e76
< 
* Excess found in a non pipelined read: excess = 17 url = /wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/3000px-Voando_num_mar_de_areia.jpg (zero-length body)
* Closing connection 0

Looks like thumbor/imagemagick are running into resource exhaustion when trying to scale this image (error below) resulting in 500s. Then poolcounter kicks in for this original due to repeated 500s while scaling and 429 are returned instead.

I'm going to retitle the task with the scaling failure itself

Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: 2018-07-25 09:02:11,246 8818 thumbor:ERROR [ImagesHandler] Exception during _load_results
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: Traceback (most recent call last):
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/handler/images/images.py", line 571, in _load_results
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: results, content_type = BaseHandler._load_results(self, context)
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: File "/usr/lib/python2.7/dist-packages/thumbor/handlers/__init__.py", line 334, in _load_results
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: results = context.request.engine.read(image_extension, quality)
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/engine/proxy/proxy.py", line 133, in read
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: ret = self.__getattr__('read')(extension, quality)
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/engine/imagemagick/imagemagick.py", line 321, in read
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: raise ImageMagickException('Failed to convert image %s' % stderr)  # pragma: no cover
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: ImageMagickException: Failed to convert image convert: DistributedPixelCache '127.0.0.1' @ error/distribute-ca
che.c/ConnectPixelCacheServer/217.
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: convert: cache resources exhausted `/srv/thumbor/tmp/thumbor@8818/tmpVBjsFF' @ error/cache.c/OpenPixelCache/36
59.
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: convert: no images defined `jpg:-' @ error/convert.c/ConvertImageCommand/3210.
Jul 25 09:02:11 thumbor1003 thumbor@8818[46987]: 2018-07-25 09:02:11,248 8818 tornado.access:ERROR 500 GET /wikipedia/commons/thumb/2/2a/Voando_num_mar_de_areia.jpg/3000px-Voando_num_mar_de_areia.jpg (10.64.32.221) 579.09ms
fgiunchedi renamed this task from Thumbnails not created: Error: 429, Too Many Requests to cache resources exhausted while scaling File:Voando_num_mar_de_areia.jpg.Jul 25 2018, 9:40 AM
ema added a comment.Jul 25 2018, 9:44 AM

Then poolcounter kicks in for this original due to repeated 500s while scaling and 429 are returned instead.

The 429s are returned with empty body, and we do add a generic WMF error page at the caching layer for 4xx errors with Content-Length: 0. It would be very useful if the application layer could provide a specific error message instead.

jcrespo added subscribers: Gilles, jcrespo.

Looks scaling related, not storage, then. CC @Gilles (sorry if I am pinging the wrong person, please comment if that is the case)

This comment was removed by Krinkle.
Krinkle renamed this task from cache resources exhausted while scaling File:Voando_num_mar_de_areia.jpg to Resource exhausted while scaling File:Voando_num_mar_de_areia.jpg.Jul 30 2018, 8:14 PM
Gilles claimed this task.Aug 27 2018, 8:14 PM
Gilles moved this task from Inbox to Backlog: Small & Maintenance on the Performance-Team board.

The issue is reproducible on Vagrant, which has the same memory and disk limits as production for ImageMagick. This image has a ton of EXIF data, a lot of it related to visual properties, probably coming from Lightroom. I'm guessing that some of it is causing ImageMagick to consume a lot of memory to resize the image. Beyond a certain thumbnail size, it just hits the limit.

We have to draw the line somewhere when it comes to individual files consuming a lot of resources to be processed. We can't increase the already generous limits for the sake of this one file (900MB max memory, 900MB max disk usage). As a workaround I would advise stripping most of the EXIF this file contains and seeing it it changes anything.

Gilles removed Gilles as the assignee of this task.Aug 29 2018, 1:49 PM
Aklapper renamed this task from Resource exhausted while scaling File:Voando_num_mar_de_areia.jpg to HTTP 500 error for some larger thumbnails of 14MB file File:Lençóis_Maranhenses_2018.jpg on Commons.Mar 29 2019, 2:03 PM
aaron triaged this task as Low priority.Jun 6 2019, 10:09 AM
aaron moved this task from Backlog: Small & Maintenance to Radar on the Performance-Team board.
aaron edited projects, added Performance-Team (Radar); removed Performance-Team.

There's a number of files for which we cannot generate a thumbnail due to exceeding various types of limits. As far as I know we handle those fairly well in terms of UX and caching by falling back to a generic/broken thumbnail.

The interesting bit here is that the file is valid and allowed to have thumbnails generated for it. And if the requested thumbnail size is under 2500px, Thumbor is able to read the original fine without issue and produce a thumbnail.

Yet, when the requested thumbnail size is between 2550 and 3000px, it ends up serving an error. Whereas our the maximum size of thumbnail we allow for e.g. Mediaviewer is above 3000px, which means the software will under normal conditions make these requests without fallback.

It's interesting case that perhaps could be mitigated by a smarter default. I would expect the amount of memory used after initial file read and stripping of meta data to be fairly predictable for a given file type and thumbnail size. That is, we can determine the most amount of memory needed to produce a thumbnail of N size where N is the maximum thumbnail size we allow (or conversely, where we set N based on how much memory we want to allow). Then separate from that, we could maybe find a limit for the memory used for the original file and metadata which is not proportional to the output thumbnail.

Alternatively, if that isn't possible, perhaps we can maybe detect somehow that we're in a situation where the requested size is valid, and the original file/metadata is not too large, but that we failed for this reason, and then maybe do something specific to this use case. E.g. serve a cacheable redirect to half the size.

Zoranzoki21 added a subscriber: Zoranzoki21.EditedSep 27 2019, 9:46 PM
kizule@kizule:/tmp$ curl -v https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Len%C3%A7%C3%B3is_Maranhenses_2018.jpg/3000px-Len%C3%A7%C3%B3is_Maranhenses_2018.jpg
* Expire in 0 ms for 6 (transfer 0x55e743c96330)
* Expire in 1 ms for 1 (transfer 0x55e743c96330)
* Expire in 0 ms for 1 (transfer 0x55e743c96330)
* Expire in 1 ms for 1 (transfer 0x55e743c96330)
* Expire in 0 ms for 1 (transfer 0x55e743c96330)
* Expire in 0 ms for 1 (transfer 0x55e743c96330)
* Expire in 2 ms for 1 (transfer 0x55e743c96330)
* Expire in 0 ms for 1 (transfer 0x55e743c96330)
* Expire in 0 ms for 1 (transfer 0x55e743c96330)
* Expire in 2 ms for 1 (transfer 0x55e743c96330)
* Expire in 0 ms for 1 (transfer 0x55e743c96330)
* Expire in 0 ms for 1 (transfer 0x55e743c96330)
* Expire in 2 ms for 1 (transfer 0x55e743c96330)
* Expire in 1 ms for 1 (transfer 0x55e743c96330)
* Expire in 1 ms for 1 (transfer 0x55e743c96330)
* Expire in 2 ms for 1 (transfer 0x55e743c96330)
* Expire in 1 ms for 1 (transfer 0x55e743c96330)
* Expire in 1 ms for 1 (transfer 0x55e743c96330)
* Expire in 4 ms for 1 (transfer 0x55e743c96330)
* Expire in 2 ms for 1 (transfer 0x55e743c96330)
* Expire in 2 ms for 1 (transfer 0x55e743c96330)
* Expire in 4 ms for 1 (transfer 0x55e743c96330)
* Expire in 2 ms for 1 (transfer 0x55e743c96330)
* Expire in 3 ms for 1 (transfer 0x55e743c96330)
* Expire in 4 ms for 1 (transfer 0x55e743c96330)
* Expire in 3 ms for 1 (transfer 0x55e743c96330)
* Expire in 3 ms for 1 (transfer 0x55e743c96330)
* Expire in 8 ms for 1 (transfer 0x55e743c96330)
* Expire in 5 ms for 1 (transfer 0x55e743c96330)
* Expire in 5 ms for 1 (transfer 0x55e743c96330)
* Expire in 8 ms for 1 (transfer 0x55e743c96330)
* Expire in 7 ms for 1 (transfer 0x55e743c96330)
* Expire in 7 ms for 1 (transfer 0x55e743c96330)
* Expire in 8 ms for 1 (transfer 0x55e743c96330)
* Expire in 9 ms for 1 (transfer 0x55e743c96330)
* Expire in 9 ms for 1 (transfer 0x55e743c96330)
* Expire in 8 ms for 1 (transfer 0x55e743c96330)
* Expire in 10 ms for 1 (transfer 0x55e743c96330)
* Expire in 10 ms for 1 (transfer 0x55e743c96330)
* Expire in 16 ms for 1 (transfer 0x55e743c96330)
* Expire in 13 ms for 1 (transfer 0x55e743c96330)
* Expire in 13 ms for 1 (transfer 0x55e743c96330)
* Expire in 16 ms for 1 (transfer 0x55e743c96330)
* Expire in 14 ms for 1 (transfer 0x55e743c96330)
* Expire in 14 ms for 1 (transfer 0x55e743c96330)
* Expire in 16 ms for 1 (transfer 0x55e743c96330)
* Expire in 50 ms for 1 (transfer 0x55e743c96330)
* Expire in 50 ms for 1 (transfer 0x55e743c96330)
* Expire in 16 ms for 1 (transfer 0x55e743c96330)
* Expire in 50 ms for 1 (transfer 0x55e743c96330)
* Expire in 50 ms for 1 (transfer 0x55e743c96330)
* Expire in 50 ms for 1 (transfer 0x55e743c96330)
*   Trying 91.198.174.208...
* TCP_NODELAY set
* Expire in 149947 ms for 3 (transfer 0x55e743c96330)
* Expire in 200 ms for 4 (transfer 0x55e743c96330)
* Connected to upload.wikimedia.org (91.198.174.208) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org
*  start date: Nov  8 21:21:04 2018 GMT
*  expire date: Nov 22 07:59:59 2019 GMT
*  subjectAltName: host "upload.wikimedia.org" matched cert's "*.wikimedia.org"
*  issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55e743c96330)
> GET /wikipedia/commons/thumb/a/a7/Len%C3%A7%C3%B3is_Maranhenses_2018.jpg/3000px-Len%C3%A7%C3%B3is_Maranhenses_2018.jpg HTTP/2
> Host: upload.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 429 
< date: Fri, 27 Sep 2019 21:44:03 GMT
< server: Varnish
< x-varnish: 756530395
< x-cache: cp3036 int
< x-cache-status: int-front
< server-timing: cache;desc="int-front"
< strict-transport-security: max-age=106384710; includeSubDomains; preload
< set-cookie: WMF-Last-Access=27-Sep-2019;Path=/;HttpOnly;secure;Expires=Tue, 29 Oct 2019 12:00:00 GMT
< x-analytics: https=1;nocookies=1
< x-client-ip: 109.245.95.114
< content-type: text/html; charset=utf-8
< content-length: 1824
< age: 0
< 
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { overflow: hidden; overflow-wrap: break-word; word-wrap: break-word; -webkit-hyphens: auto; -moz-hyphens: auto; -ms-hyphens: auto; hyphens: auto; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf-logo.png" srcset="https://www.wikimedia.org/static/images/wmf-logo-2x.png 2x" alt="Wikimedia" width="135" height="101">
</a>
<h1>Error</h1>
<div class="content-text">
<p>Our servers are currently under maintenance or experiencing a technical problem.

Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>

<p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
</div>
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 109.245.95.114 via cp3036 frontend, Varnish XID 756530395<br>Upstream caches: cp3036 int<br>Error: 429, Too Many Requests at Fri, 27 Sep 2019 21:44:03 GMT</code></p>
</div>
</html>
* Connection #0 to host upload.wikimedia.org left intact
kizule@kizule:/tmp$

I don't know what means error 429 for this. Update: I saw reply from @Krinkle so we "comment conflicted" :).

Krinkle added a comment.EditedSep 27 2019, 9:47 PM

HTTP 429 means "Too Many Requests". Our servers use this when too many requests are happening for the same error. It is a protection mechanism.

For the purpose of this task, it is basically the same as HTTP 500 "Internal Error", and means that the request failed due to a limitation or problem on the server.

Zoranzoki21 added subscribers: Jasper, Matanya, Steinsplitter and 5 others.

HTTP 429 means "Too Many Requests". Our servers use this when too many requests are happening for the same error. It is a protection mechanism.
For the purpose of this task, it is basically the same as HTTP 500 "Internal Error", and means that the request failed due to a limitation or problem on the server.

I understand it already, but thank you very much!

There is a proper solution to this class of large JPG issues, which is to use a streaming resize like VIPS for them. The problem is that currently JPGs are processed by ImageMagick, which loads the entire data in memory to resize an image. Switching to a streaming resizer is no small task, though, and it's unclear until you've really tried if VIPS can really handle all types of JPGs as well as ImageMagick in practice.