Page MenuHomePhabricator

ThumbnailRender job spam due to HTTP 500s
Closed, ResolvedPublic

Description

Probably due to the thumb size < original size rule. Those should give HTTP 412 or something, and then the job can return "true" in run() for such cases, rather than logging and retrying two more times (and logging).

2015-07-23 08:31:09 mw1010 commonswiki runJobs ERROR: ThumbnailRender File:Silene_otites_002.JPG transformParams={"width":"640"} (uuid=fbfbc971ef4745f39f4c1dae81c87570,timestamp=1437640260,QueuePartition=rdb1) t=8620 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:08 mw1004 commonswiki runJobs ERROR: ThumbnailRender File:Capitaine_de_vaisseau-IMG_9252.JPG transformParams={"width":"1280"} (uuid=e10f0d79ba4e4e6d820d0ccc4a162c66,timestamp=1437640319,QueuePartition=rdb1) t=8224 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:08 mw1007 commonswiki runJobs ERROR: ThumbnailRender File:Capitaine_de_vaisseau-IMG_9252.JPG transformParams={"width":"800"} (uuid=b01d5e67754b40c2bca5149b4ff10470,timestamp=1437640319,QueuePartition=rdb1) t=8227 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:08 mw1004 commonswiki runJobs ERROR: ThumbnailRender File:Capitaine_de_vaisseau-IMG_9252.JPG transformParams={"width":"1024"} (uuid=cdd28d7220054f02a9bd1138f65865ab,timestamp=1437640319,QueuePartition=rdb1) t=8257 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:08 mw1002 commonswiki runJobs ERROR: ThumbnailRender File:Capitaine_de_vaisseau-IMG_9252.JPG transformParams={"width":"320"} (uuid=7f0ca2623ed24392ad8ca7c1ee8b1a96,timestamp=1437640319,QueuePartition=rdb1) t=8184 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:08 mw1002 commonswiki runJobs ERROR: ThumbnailRender File:Capitaine_de_vaisseau-IMG_9252.JPG transformParams={"width":"640"} (uuid=5ccea45bd1254be0a30d2c2072a07d9f,timestamp=1437640319,QueuePartition=rdb1) t=8199 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:42 mw1002 commonswiki runJobs ERROR: ThumbnailRender File:Silene_otites_003.JPG transformParams={"width":"800"} (uuid=eeec2978013c4ce588e53ab76a8eb425,timestamp=1437640353,QueuePartition=rdb2) t=8156 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:42 mw1002 commonswiki runJobs ERROR: ThumbnailRender File:Silene_otites_003.JPG transformParams={"width":"640"} (uuid=a9975fc34238417eaf9620758c433f0f,timestamp=1437640353,QueuePartition=rdb2) t=8203 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:32:42 mw1001 commonswiki runJobs ERROR: ThumbnailRender File:Silene_otites_003.JPG transformParams={"width":"320"} (uuid=36dceb3042354b7aa0605367e9092190,timestamp=1437640353,QueuePartition=rdb2) t=8232 error=ThumbnailRenderJob::run: incorrect HTTP status 500
2015-07-23 08:33:18 mw1016 commonswiki runJobs ERROR: ThumbnailRender File:Contre-amiral-IMG_9154.JPG transformParams={"width":"2560"} (uuid=0e65942249454dadae5888e6cfb9dd38,timestamp=1437640389,QueuePartition=rdb1) t=8219 error=ThumbnailRenderJob::run: incorrect HTTP status 500

Event Timeline

aaron raised the priority of this task from to Needs Triage.
aaron updated the task description. (Show Details)
aaron subscribed.

Hah, just figured out what these files have in common. An uppercase .JPG in their page name.

What's happening isn't a huge deal, though, it just means that thumbnail pre-rendering at upload time wasn't working for those files.

Gilles triaged this task as Medium priority.Jul 28 2015, 2:43 PM

Change 227453 had a related patch set uploaded (by Gilles):
Gather more information about pre rendering 500s

https://gerrit.wikimedia.org/r/227453

Change 227456 had a related patch set uploaded (by Gilles):
Don't retry invalid thumbnail requests due to impossible width

https://gerrit.wikimedia.org/r/227456

On beta I see the 400 translated in a 500 by Varnish, but I can't pinpoint in the Varnish config where this is happening:

curl -i "http://commons.wikimedia.beta.wmflabs.org/w/thumb.php?f=Carl_Simon.jpg&width=793454"

Gives a 400, as expected. While the following gives a 500:

curl -H "Host: upload.beta.wmflabs.org" -i http://deployment-cache-upload02.eqiad.wmflabs/wikipedia/commons/thumb/3/3f/Carl_Simon.jpg/793454px-Carl_Simon.jpg

@fgiunchedi I remember that we "fixed" the 500/400 response for thumbs in T88412 and as seen above, the PHP returns the correct header at the moment. However I really can't figure out why Varnish turns that 400 into a 500. Any ideas?

Change 227453 merged by jenkins-bot:
Gather more information about pre rendering 500s

https://gerrit.wikimedia.org/r/227453

Change 227456 merged by jenkins-bot:
Don't retry invalid thumbnail requests due to impossible width

https://gerrit.wikimedia.org/r/227456

@fgiunchedi I remember that we "fixed" the 500/400 response for thumbs in T88412 and as seen above, the PHP returns the correct header at the moment. However I really can't figure out why Varnish turns that 400 into a 500. Any ideas?

mmh do you see the same in production? I can't reproduce ATM

i7:~$ curl -i "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Carl_Simon.jpg/79345px-Carl_Simon.jpg"
HTTP/1.1 400 Bad Request
Server: nginx/1.9.3
Date: Wed, 29 Jul 2015 13:10:05 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 385
Connection: keep-alive
X-Content-Type-Options: nosniff
X-Powered-By: HHVM/3.6.1
X-Mw-Thumbnail-Renderer: mw1155
Cache-Control: no-cache
X-Trans-Id: tx14125093230b4248aebd9-0055b8d098
X-Varnish: 3677239121, 1284444796 1284418870, 3300715481
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Accept-Ranges: bytes
Age: 20
X-Cache: cp1074 miss (0), cp3036 hit (1), cp3046 frontend miss (0)
Strict-Transport-Security: max-age=31536000
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache, X-Varnish
Timing-Allow-Origin: *

<!DOCTYPE html>
<html><head>
<meta charset="UTF-8" />
<title>Error generating thumbnail</title>
</head>
<body>
<h1>Error generating thumbnail</h1>
<p>
Error creating thumbnail: Image was not scaled, is the requested width bigger than the source?
</p>
<!-- http://commons.wikimedia.org/w/thumb_handler.php/3/3f/Carl_Simon.jpg/79345px-Carl_Simon.jpg -->
<!-- mw1155 -->

</body>
</html>
i7:~$
mw1155:~$ curl -i -H 'Host: commons.wikimedia.org' -H 'X-forwarded-proto: https' 'http://commons.wikimedia.org/w/thumb_handler.php/3/3f/Carl_Simon.jpg/79345px-Carl_Simon.jpg'
HTTP/1.1 400 Bad Request
Server: Apache
X-Content-Type-Options: nosniff
Cache-control: no-cache
X-MW-Thumbnail-Renderer: mw1159
Content-Type: text/html; charset=utf-8
X-Varnish: 1357378451, 840169381
Via: 1.1 varnish, 1.1 varnish
Content-Length: 357
Accept-Ranges: bytes
Date: Wed, 29 Jul 2015 13:13:58 GMT
Age: 0
Connection: keep-alive
X-Cache: cp1065 miss (0), cp1054 frontend miss (0)
Strict-Transport-Security: max-age=31536000
Set-Cookie: GeoIP=:::::v6; Path=/; Domain=.wikimedia.org
X-Analytics: https=1
Set-Cookie: WMF-Last-Access=29-Jul-2015;Path=/;HttpOnly;Expires=Sun, 30 Aug 2015 12:00:00 GMT

<!DOCTYPE html>
<html><head>
<meta charset="UTF-8" />
<title>Error generating thumbnail</title>
</head>
<body>
<h1>Error generating thumbnail</h1>
<p>
Error creating thumbnail: Image was not scaled, is the requested width bigger than the source?
</p>
<!-- /w/thumb_handler.php/3/3f/Carl_Simon.jpg/79345px-Carl_Simon.jpg -->
<!-- mw1159 -->

</body>
</html>
mw1155:~$

mmh do you see the same in production? I can't reproduce ATM

Indeed, production looks fine and returns a 400, meaning that my change has already made the production logs less noisy.

The original issue found in @aaron's initial sample is still a mystery, now that I'm gathering more data:

The vast majority of 500s involved behave fine when requested later:

gilles@fluorine:/a/mw-log$ cat runJobs.log | grep ThumbnailRender | grep "status 500 when hitting" | awk -F " " '{print $NF}' | uniq | xargs -I thumburl curl -L -H "Host: upload.wikimedia.org" -H "X-forwarded-proto: https" -s -o /dev/null -I -w "%{http_code}\n" "http:thumburl" | sort | uniq -c
   1755 200
     96 500

More investigation is required to figure this out.

Checking the severity of this issue:

gilles@fluorine:/a/mw-log$ cat runJobs.log | grep ThumbnailRender | grep "\|\"width\":\"1920\"\"width\":\"1280\"\|\"width\":\"1024\"\|\"width\":\"800\"\|\"width\":\"640\"" | grep " good" | wc -l
37526
gilles@fluorine:/a/mw-log$ cat runJobs.log | grep ThumbnailRender | grep "\|\"width\":\"1920\"\"width\":\"1280\"\|\"width\":\"1024\"\|\"width\":\"800\"\|\"width\":\"640\"" | grep "status 500 when hitting" | awk -F " " '{print $NF}' | uniq | wc -l
9438

gilles@fluorine:/a/mw-log$ cat runJobs.log | grep ThumbnailRender | grep "\|\"width\":\"1920\"\"width\":\"1280\"\|\"width\":\"1024\"\|\"width\":\"800\"\|\"width\":\"640\"" | grep " good" | awk -F " " '{print $8}' | uniq | wc -l
25292
gilles@fluorine:/a/mw-log$ cat runJobs.log | grep ThumbnailRender | grep "\|\"width\":\"1920\"\"width\":\"1280\"\|\"width\":\"1024\"\|\"width\":\"800\"\|\"width\":\"640\"" | grep "status 500 when hitting" | awk -F " " '{print $8}' | uniq | wc -l
3157

Meaning that pre-rendering doesn't work at the time it's requested for about 12% of the files.

Change 304995 had a related patch set uploaded (by Gilles):
Delay thumbnail pre rendering job by a second

https://gerrit.wikimedia.org/r/304995

Change 304995 abandoned by Gilles:
Delay thumbnail pre rendering job by a second

Reason:
I think the answer is to move the code triggering the thumbnail prerendering to a better location, inside recordUpload2's onTransactionIdle call

https://gerrit.wikimedia.org/r/304995

Change 305010 had a related patch set uploaded (by Gilles):
Move thumbnail rendering to a more appropriate spot

https://gerrit.wikimedia.org/r/305010

Change 305010 merged by jenkins-bot:
Move thumbnail rendering to a more appropriate spot

https://gerrit.wikimedia.org/r/305010