Page MenuHomePhabricator

Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408
Open, MediumPublicBUG REPORT

Description

Steps to Reproduce:
Open: the 1000px-version of https://commons.wikimedia.org/wiki/File:Dojikko2.3.svg
https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Dojikko2.3.svg/1000px-Dojikko2.3.svg.png

Actual Results:

Screenshot_2019-06-22 Wikimedia Error.png (520×749 px, 14 KB)

Error

Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes.

See the error message at the bottom of this page for more information.

If you report this error to the Wikimedia System Administrators, please include the details below.

Request from 217.149.174.137 via cp3038 frontend, Varnish XID 635598966
Upstream caches: cp3038 int
Error: 429, Too Many Requests at Sat, 22 Jun 2019 20:30:32 GMT

from: https://en.wikipedia.org/wiki/Wikipedia:SVG_help#Server_error_when_generating_SVG_thumbnail

Request from 87.112.20.249 via cp1080 cp1080, Varnish XID 228725525
Error: 500, Internal Server Error at Mon, 01 Apr 2019 20:38:04 GMT

Expected Results:
Quting Glrx from https://en.wikipedia.org/wiki/Wikipedia:SVG_help#Server_error_when_generating_SVG_thumbnail

My guess is the larger image is too complex to render in the time allotted by the server. Thus it is not a bug but a resource decision by the server managers. The image does appear to make unreasonable resource demands by filtering many small circles. Perhaps the real error is the server issued a 500 internal error rather than a 408 timeout error. Glrx (talk) 18:32, 2 April 2019 (UTC)

Event Timeline

Aklapper renamed this task from SVG-Rendering of huge SVG leads to "Error: 500, Internal Server Error" or "Error: 429, Too Many Requests" instead of a "Error: 408, Timeout Error" to Thumbnail rendering of huge SVG file leads to Error 500 or Error 429 instead of Error 408.Jun 22 2019, 9:21 PM
Aklapper edited projects, added Thumbor; removed Wikimedia-SVG-rendering.

Changed the project tags and task summary: No SVG rendering involved, as this is about bitmap (non SVG) thumbnails of an SVG image.

Cmglee renamed this task from Thumbnail rendering of huge SVG file leads to Error 500 or Error 429 instead of Error 408 to Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408.EditedJun 23 2019, 12:02 AM

In the case Glrx described (http://commons.wikimedia.org/wiki/File:Apollonian_gasket_symmetrical_3.svg), the SVG was not huge, but overly complex as SVG filters were applied to over 4000 elements. Nevertheless, the current error message is misleading.

TheDJ subscribed.

Pretty sure this is a wider operations issue for esams connections. I noticed it yesterday around 19:26 CEST and again at 22:40 (for cp1077)

Request from 82.197.207.202 via cp1077 cp1077, Varnish XID 183713489
Error: 503, Backend fetch failed at Sat, 22 Jun 2019 20:40:13 GMT

Around that time Krenair noticed it too as well
We couldn't find anything, but there are weirdly consistent round robin-style peaks per cp server of increased failed backend fetches:

https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?panelId=6&fullscreen&orgId=1&from=now%2Fw&to=now&var-datasource=esams%20prometheus%2Fops&%20var-cache_type=text&var-server=All&var-layer=backend

Screenshot 2019-06-23 at 09.15.14.png (850×2 px, 946 KB)

TheDJ renamed this task from Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 to Increased failure rate of varnish be fetches.Jun 23 2019, 7:17 AM

Pretty sure this is a wider operations issue for esams connections.
[...]
We couldn't find anything, but there are weirdly consistent round robin-style peaks per cp server of increased failed backend fetches: [...]

@TheDJ: Sounds very similar to T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster)? Maybe not fixed?

So the peaks on the graph stop right around the time in the -operations channel there was the log:

21:31 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0:

https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down

21:31 <+icinga-wm> PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status

Possibly related ?

We have a higher volume of thumbnail rendering than usual due to the deployment of T216339: Normalize thumbnail request URLs in Varnish to avoid cachebusting, which results in twice the amount of 503s.

What I find weird about this is that the haproxy queue on the thumbor cluster is always at 0. If this was a capacity problem on the Thumbor cluster, we should see some request queueing. And while the amount of 200s has clearly increased on the Thumbor cluster, the amount of 429s hasn't. This suggests that the request mights complete on the Thumbor end, and the swift proxy gives up on them too early. I'll double check the timeout settings at the Swift proxy level.

I suspect it's an issue at the Swift level, possibly a capacity problem with the added Thumbnail miss load. I'm seeing these errors happening frequently on the swift proxy:

Jun 24 06:26:24 ms-fe1005 proxy-server: ERROR with Object server 10.64.16.80:6000/sdm1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.ff/f/ff/Ryttyl%C3%A4n_asema.jpg/250px-Ryttyl%C3%A4n_asema.jpg: ConnectionTimeout (0.5s) (txn: tx4586dacb45a44ed5ad3d9-005d106d10) (client_ip: x.x.x.x)

It seems to happen in bursts, where one object server will have a few of those timeouts in short succession, then it happens to another object server, etc.

While the amount of errors has doubled as a result of the increased thumbnail traffic, it's definitely a pre-existing issue, that's simply been made more visible (entries in server.log.4 end on Jun 19 06:24:53 ):

gilles@ms-fe1005:~$ cat server.log.2 | grep "ConnectionTimeout" | wc -l
6868
gilles@ms-fe1005:~$ cat server.log.4 | grep "ConnectionTimeout" | wc -l
2803
ema renamed this task from Increased failure rate of varnish be fetches to Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408.Jun 24 2019, 9:44 AM
ema triaged this task as Medium priority.

We couldn't find anything, but there are weirdly consistent round robin-style peaks per cp server of increased failed backend fetches:

The task description mentions 429 and 500 errors, which are unrelated to varnish backend fetch failures. Task title reverted to the previous one, and T226375 created to track the fetch error issue.

@Gilles we've just had an interesting report on #wikimedia-operations that seems related.

11:11 < odder> Error: 429, Too Many Requests at Tue, 25 Jun 2019 11:11:08 GMT
11:12 < odder> https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf/page257-836px-Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf.jpg

I could reproduce and capture a full varnishlog on the frontend, see it here: P8653

Confusingly, the response seems to be due to varnish rate limiting as X-Cache-Status indicates an int response:

-  RespHeader X-Cache-Status: int-front

The reason for int is that varnish rewrites the status message:

-   RespReason     Unknown HTTP Status
-   RespReason     Too Many Requests

Instead, the response is coming straight from thumbor as shown by the varnishlog above and the relevant ATS logs:

Date:2019-06-25 Time:11:29:45 ConnAttempts:0 ConnReuse:16 TTFetchHeaders:174 OriginServerTime:174 CacheResultCode:TCP_MISS CacheWriteResult:- ReqMethod:GET RespStatus:429 OriginStatus:429 ReqURL:http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf/page257-836px-Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf.jpg BereqURL:GET https://swift.discovery.wmnet/wikipedia/commons/thumb/c/ca/Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf/page257-836px-Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf.jpg HTTP/1.1 ReqHeader:User-Agent:curl/7.64.0 ReqHeader:Host:upload.wikimedia.org ReqHeader:X-Client-IP:82.82.91.100 RespHeader:X-Cache-Int:cp3045 miss
ema moved this task from Backlog to Caching on the Traffic board.

kibana entries for the https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf/page257-836px-Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf.jpg failure: https://logstash.wikimedia.org/goto/1f1b3fb6e8821924e1d9a185efb9a869

The first entry:

[ExiftoolRunner] error: 'Warning: Skipped unknown 99 byte header - /srv/thumbor/tmp/thumbor@8802/tmpt2WcOM\n'

The second entry:

ImageMagickException: Failed to convert image convert: no decode delegate for this image format `' @ error/constitute.c/ReadImage/504.
convert: no images defined `jpg:-' @ error/convert.c/ConvertImageCommand/3258.

The rest of the errors are like the above, in groups of two.

Note that thumbor is occasionally returning 500 for that object. Hitting ATS to skip varnish-fe transformations:

< HTTP/1.1 500 Internal Error
< Date: Tue, 25 Jun 2019 12:24:50 GMT
< Content-Type: text/html; charset=UTF-8
< Content-Length: 0
< Thumbor-Request-Id: 0A4020DC:278A_0A020218:2260_5D12128A_106E9:42E5
< Proxy-Request-Date: 25/Jun/2019:12:24:42 +0000
< Thumbor-Request-Date: Tue, 25 Jun 2019 12:24:42 GMT
< Thumbor-Gs-Time: 773
< Thumbor-Swift-Thumbnail-Miss-Time: 47
< Server: ATS/8.0.3
< Thumbor-Convert-Time: 16, 25, 29
< Thumbor-Exiftool-Time: 202, 154
< Thumbor-Swift-Original-Success-Time: 6626
< Proxy-Response-Date: 25/Jun/2019:12:24:42 +0000
< Thumbor-Poolcounter-Time: 12
< Thumbor-Memcache-Set-Time: 22
< Thumbor-Memcache-Get-Time: 1
< X-Upstream: 127.0.0.1:8811
< Thumbor-Engine: wikimedia_thumbor.engine.ghostscript
< X-Trans-Id: txf36a6299045e43aab7b04-005d12128a
< Age: 9
< Connection: keep-alive
< X-Cache-Int: cp3045 miss

A lot of files fail to render for various reasons, and end up as 429s because we don't want to constantly retry rendering a failing file, that would be a waste of resources. See Failure throttling documentation here: https://wikitech.wikimedia.org/wiki/Thumbor#Throttling

At this point it's been a long time since I've seen an issue that had to do with Thumbor itself, all remaining bugs have to do with whatever underlying command/library that processes those images being unable to render particular files. Or it hits memory/duration limits. Regarding the SVG in this bug report, it's probably failing due to excessive use of filters, as pointed out by @Cmglee

As for the PDF you've just been mentioning in comments, it seems like Ghostscript can't process it. It deserves its own task, whose solution is highly likely to be "upgrade to a more recent version of Ghostscript", like most bugs of that nature.

Just bumped into another very frequent one:

08:18:54 ema@cp5005.eqsin.wmnet:~
$ curl -i -w "Connect: %{time_connect} TTFB: %{time_starttransfer} Total time: %{time_total}\n" https://swift.discovery.wmnet/wikipedia/commons/thumb/4/4d/Constitution_of_India.jpg/113px-Constitution_of_India.jpg.webp 
HTTP/2 429 
date: Thu, 27 Jun 2019 08:19:59 GMT
content-type: text/html; charset=UTF-8
content-length: 0
thumbor-request-id: 0AC010BE:3328_0A020118:2260_5D147C27_113468:65DA
thumbor-memcache-get-time: 11
thumbor-request-date: Thu, 27 Jun 2019 08:19:51 GMT
server: TornadoServer/4.4.3
thumbor-poolcounter-time: 8013
proxy-response-date: 27/Jun/2019:08:19:51 +0000
cache-control: no-cache
proxy-request-date: 27/Jun/2019:08:19:51 +0000
x-upstream: 127.0.0.1:8814
x-trans-id: tx182d805b02ef4b7a81797-005d147c27

Connect: 0.447759 TTFB: 9.175094 Total time: 9.175171

I've seen this one stuck in poolcounter throttling for a while, it's definitely quite hot. I'll check if it's renderable at all. Usually images that can't render get rather throttled by the failure throttle, not by poolcounter.

@Gilles FWIW, I just got:
Request from 176.207.117.69 via cp3038 frontend, Varnish XID 415308351
Upstream caches: cp3038 int
Error: 429, Too Many Requests at Wed, 16 Oct 2019 14:06:59 GMT

while trying to access other resolutions for https://commons.wikimedia.org/wiki/File:A_Universal_Code_of_Conduct_-_with_input_from_the_CEE_communities.pdf,
since it won't display any preview for pages from 7 on.
If irrelevant, should be moved etc., please LMK.

@Elitre it should be its own task, since it's a PDF failing to render and this task is about an SVG.

oK,will file separately then, TY,

rsvg-convert 2.40.16 processed this file at 1000px in 75 seconds, which is over the 60s timeout. 2.48.4 did it in about 5 seconds. There's over 700 uses of Gaussian blur in the file, which is the most likely culprit. T200866 is the task for that, but I'll leave this one open because of the question about Memcached or Poolcounter doing the throttling.

As for the issue mentioned in the task title, I don't believe 408 would be a correct response code. A 408 is the server closing a connection because no request was made by the client before the timeout, while this error is internal to the server during the processing of a request.

BBlack subscribed.

Removing Traffic during a ticket cleanup, as it seems like this issue lies within thumbor and not the traffic infra layers. Feel free to add us back if some VCL support is needed for any eventual changes here!