Page MenuHomePhabricator

Wikimedia thumbnail generator often returns HTTP 500 Internal Server Error
Closed, ResolvedPublic

Description

When requesting random images from Commons' MediaWiki API and requesting a thumbnail for it, I often get a HTTP 500.

For example:

GET http://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/USMC-05934.jpg/400px-USMC-05934.jpg
HTTP 500
<html><head><title>Error generating thumbnail</title></head>
<body>
<h1>Error generating thumbnail</h1>
<p>
Error creating thumbnail:
</p>
<!-- http://commons.wikimedia.org/w/thumb_handler.php/f/fb/USMC-05934.jpg/400px-USMC-05934.jpg -->
<!-- mw1154 -->
</body>
</html>

Usually when trying again, it just works.

I've been getting these errors in many different environments:

  • When reading articles and HiDPI plugin swaps the src attributes (the larger version would fail maybe)
  • When opening the VisualEditor (rendering the new DOM means we re-parse the <img>tag and thus re-request it, thus making it more likely for the error to happen again)
  • When working with gadgets that render image galleries through requesting file category members and the thumbnail url.

I don't think the scenario is relevant, there is either something wrong with the thumbnail generator script that is triggered by lots of images. Or there is a few faulty servers in the upload.wikimedia.org pool that cause the errors.


Version: wmf-deployment
Severity: critical

Details

Reference
bz54045

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:10 AM
bzimport set Reference to bz54045.
bzimport added a subscriber: Unknown Object (MLST).
Krinkle created this task.Sep 11 2013, 11:17 PM

See also bug 53573. (which was mostly mw1153)

I don't think the scenario is relevant, there is either something wrong with
the thumbnail generator script that is triggered by lots of images. Or there is
a few faulty servers in the upload.wikimedia.org pool that cause the errors.

Thumbnail generation should be entirely independent of page view actions, so the scenario should definitely not matter

Always mw1154?

See also bug 53573, mw1153 was (not sure if it's changed) seemingly causing a lot more errors than its counterparts

Looks like mw1153 has stopped being so noisy, and it's now mw1154. Which I guess makes this a dupe of 53753

reedy@fluorine:/a/mw-log$ grep -c error thumbnail.log
324850
reedy@fluorine:/a/mw-log$ grep mw1153 thumbnail.log |grep -c error
22013
reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error
54280
reedy@fluorine:/a/mw-log$ grep mw1155 thumbnail.log |grep -c error
22011
reedy@fluorine:/a/mw-log$ grep mw1156 thumbnail.log |grep -c error
21980
reedy@fluorine:/a/mw-log$ grep mw1157 thumbnail.log |grep -c error
21847
reedy@fluorine:/a/mw-log$ grep mw1158 thumbnail.log |grep -c error
22146
reedy@fluorine:/a/mw-log$ grep mw1159 thumbnail.log |grep -c error
21905
reedy@fluorine:/a/mw-log$ grep mw1160 thumbnail.log |grep -c error
22107

I just did a test on commons - visited a page with 200 images which I suspect were not previously rendered before (note: all were tiff files). 28 of them failed, all from mw1154. There's only 8 image scalars. 200/8 = 25, so its not hard to imagine this means all requests to mw1154 are failing.

Perhaps something is wrong with the cgroups config on that host (or something else that would prevent all shell commands from succeeding)

The ganglia graph seems to indicate that the CPU usage on that host has dropped off since friday - http://ganglia.wikimedia.org/latest/graph.php?g=cpu_report&z=xlarge&c=Image%20scalers%20eqiad&h=mw1154.eqiad.wmnet&l=e2ecff&v=&r=week&su=1&st=1378942982&x=0&n=0

reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error
55424
reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log
111320

When you look at the 500 error message, its supposed to include all the stdout and stderr from convert. However these are empty. Perhaps https://gerrit.wikimedia.org/r/83974 would help debug the situation (including stderr from limit.sh in the 500 error message).

(In reply to comment #5)

reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error
55424
reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log
111320

We pass things with \n's in them to wfDebugLog for the thumbnail log. I don't think we log successful thumbnailing events at all. So I think all that means is many of the thumbnail error log messages span several lines.

  • Bug 54188 has been marked as a duplicate of this bug. ***

Btw, good debugging step would be for someone with shell to try running the convert command by hand (with limit.sh) to see what happens (e.g. if issue with cgroup config this would give an error that wouldn't be included in thumb log)

See also gerrit change 83974 which would give more useful error messages in certain circumstances (which may or may not help for the current situation)

The entries in /a/mw-log/thumbnail.log are often hard to correlate since they use temporary file names. However it appears some (or all?) contain a MediaWiki File-namespace url as well.

I just got another HTTP 500 Internal Server Error and found the following entry:

URL: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Trevor_Parscal_December_2008.jpg/660px-Trevor_Parscal_December_2008.jpg

Triggered from an <img> tag on https://commons.wikimedia.org/w/index.php?title=File:Trevor_Parscal_December_2008.jpg&action=submit when previewing an edit.

[04:11 UTC] krinkle at fluorine in /a/mw-log
$ tail thumbnail.log -n500 | ack-grep Trevor -C 10

Bytes: 0xFF 0x27 0x20 0x69" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 61 -o '/tmp/transform_5a95aeb43c6b-1.png' '/tmp/localcopy_61a6c00091e1-1.svg' 2>&1"
2013-09-17 04:10:06 mw1154 commonswiki: thumbnail failed on mw1154: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=660x440 '/tmp/localcopy_cb6372f89daf-1.jpg' -thumbnail '660x440!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Trevor_Parscal_December_2008.jpg' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_f22fd870a3fc-1.jpg' 2>&1"
2013-09-17 04:10:06 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_f22fd870a3fc-1.jpg". unlink() succeeded
2013-09-17 04:10:06 mw1157 commonswiki: thumbnail failed on mw1157: error 1 "convert: no decode delegate for this image format `/a/magick-tmp/magick-NyMRfqeG' @ error/constitute.c/ReadImage/532.
convert: missing an image filename `/tmp/transform_6c0a3d0fd7c4-1.jpg' @ error/convert.c/ConvertImageCommand/3011." from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=120x53 '' -thumbnail '120x53!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Banknote_5000_rubles_(1997)_front.jpg' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_6c0a3d0fd7c4-1.jpg' 2>&1"
..
2013-09-17 04:13:59 mw1159 commonswiki: thumbnail failed on mw1159: error 1 "Error reading SVG:Error domain 1 code 96 on line 1 column 17 of file:///tmp/localcopy_2362643f10ec-1.svg: Malformed declaration expecting version" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 96 -o '/tmp/transform_8ba6acf9d427-1.png' '/tmp/localcopy_2362643f10ec-1.svg' 2>&1"
2013-09-17 04:13:59 mw1154 commonswiki: thumbnail failed on mw1154: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=660x440 '/tmp/localcopy_babba88e51a4-1.jpg' -thumbnail '660x440!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Trevor_Parscal_December_2008.jpg' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_de7473e35169-1.jpg' 2>&1"
2013-09-17 04:13:59 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_de7473e35169-1.jpg". unlink() succeeded
2013-09-17 04:14:01 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_55d5b98e5f7b-1.png". unlink() succeeded

Could this server be taken out of rotation until the issue is resolved?

Part of the question would be is it still the same machine? (based on the fact it's already changed once)

Ganglia graphs suggest so (mw1154 has less activity) but I haven't specificly tested.

Reedy added a comment.Sep 19 2013, 1:05 AM

Yup, still mw1154

reedy@fluorine:/a/mw-log$ grep -c mw1153 thumbnail.log
50057
reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log
125850
reedy@fluorine:/a/mw-log$ grep -c mw1155 thumbnail.log
50451
reedy@fluorine:/a/mw-log$ grep -c mw1156 thumbnail.log
50584
reedy@fluorine:/a/mw-log$ grep -c mw1157 thumbnail.log
51015
reedy@fluorine:/a/mw-log$ grep -c mw1158 thumbnail.log
50386
reedy@fluorine:/a/mw-log$ grep -c mw1159 thumbnail.log
50134
reedy@fluorine:/a/mw-log$ grep -c mw1160 thumbnail.log
51486

Reedy added a comment.Sep 19 2013, 1:10 AM

[02:06:08] <Reedy> Can someone depool mw1154 from the image scaler cluster?
[02:06:20] <paravoid> what's wrong with it?
[02:07:40] <Reedy> It's seemingly generating a lot more errors (by a factor of 2 or so) than any of the other scalers
[02:07:41] <Reedy> https://bugzilla.wikimedia.org/show_bug.cgi?id=54045
[02:08:32] <paravoid> root@mw1154:~# ls /sys/fs/cgroup/memory/mediawiki/job
[02:08:33] <paravoid> ls: cannot access /sys/fs/cgroup/memory/mediawiki/job: No such file or directory
[02:08:35] <paravoid> blergh
[02:09:01] <paravoid> fixed
[02:09:58] <Reedy> is that's what's up with it?
[02:10:05] <paravoid> yes, fixed
[02:10:09] <Reedy> awesome
[02:10:10] <Reedy> thanks

Gilles raised the priority of this task from High to Unbreak Now!.Dec 4 2014, 10:21 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to High.Dec 4 2014, 11:20 AM