Starting on 2018-08-01, ThumbnailRender jobs started failing because the transformation width stopped being included in the fetch URL even though they are being as seen in logstash.
Description
Details
Related Objects
Event Timeline
A quick investigation of merged commits for includes/jobqueue/jobs/ThumbnailRender.php, includes/file/File.php and related includes/media/* files (which are used in the generation of the URL) did not turn up any recent changes.
Weird. ImageHandler::makeParamString would throw an exception if the width wouldn't be specified, and that's not happening.
Where did you find out the reason is specifically that the width is not included in the final request URI? I can't find this info anywhere
The logstash errors (the error field, not the message) are all along the lines of ThumbnailRenderJob::run: incorrect HTTP status 404 when hitting //ms-fe.svc.eqiad.wmnet/wikipedia/commons/thumb/1/1c/Foo.tif/-Foo.tif.png
... which should be //ms-fe.svc.eqiad.wmnet/wikipedia/commons/thumb/1/1c/Foo.tif/${width}-Foo.tif.png. Kudos to @fgiunchedi for noticing the missing piece.
Weird indeed. I went through all the code paths that generate the file name and couldn't find one that would either omit or silently drop the width from the file name.
The problem persists, btw. There were 8k+ failures in the last 24h.
One more data point: most of these failures come from wmf-15 code (less than 1% of the messages come from wmf-14 and wmf-13 combined), which seems to suggest a change in wmf-15 is causing this. Alas, after going through the diff, I couldn't find any (obvious) candidates.
T200346: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" probably masked this error until recently.
Running the job manually from mwscript works fine. Does this failure happen on every job run or just some fraction of the attempts?
Does this failure happen on every job run or just some fraction of the attempts?
According to Grafana we've got ~1 job/s so it's almost 1000 jobs per 15 minutes for TumbnailRender job, according to kibana we get 50 failures per 15 minutes - so no, not all of them fail definitely.
Interesting fact is that trying to do the failed job again manually seems to succeed:
curl -i -XPOST -H 'content-type: application/json' --data '{"database": "commonswiki", "mediawiki_signature": "82391e33b714f7fc1408c298730cb2a6cc25cb75e5cb16e51b38dcc1cbf5fdda", "meta": {"domain": "commons.wikimedia.org", "dt": "2018-08-07T17:41:45+00:00", "id": "2784b115-9a69-11e8-86e2-14187761316a", "request_id": "W2nZ1gpAAEMAAFSG1csAAACT", "schema_uri": "mediawiki/job/2", "topic": "mediawiki.job.ThumbnailRender", "uri": "https://commons.wikimedia.org/wiki/File:BSicon_lINTACC-M.svg"}, "page_namespace": 6, "page_title": "File:BSicon_lINTACC-M.svg", "params": {"requestId": "W2nZ1gpAAEMAAFSG1csAAACT", "transformParams": {"width": 1280}}, "type": "ThumbnailRender"}' https://jobrunner.discovery.wmnet/rpc/RunSingleJob.php HTTP/2 200
Change 451332 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Add temporary debug logging to ThumbnailRenderJob
Change 451332 abandoned by Gergő Tisza:
Add temporary debug logging to ThumbnailRenderJob
Reason:
should be done on a deploy branch instead
Change 451663 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@wmf/1.32.0-wmf.16] Add temporary debug logging to ThumbnailRenderJob
Change 451666 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Add temporary debug logging to ThumbnailRenderJob
Change 451663 merged by Niharika29:
[mediawiki/core@wmf/1.32.0-wmf.16] Add temporary debug logging to ThumbnailRenderJob
Mentioned in SAL (#wikimedia-operations) [2018-08-09T18:58:11Z] <niharika29@deploy1001> Synchronized php-1.32.0-wmf.16/includes/jobqueue/jobs/ThumbnailRenderJob.php: Add temporary debug logging T201305 (duration: 01m 11s)
Change 451666 merged by jenkins-bot:
[operations/mediawiki-config@master] Add temporary debug logging to ThumbnailRenderJob
Mentioned in SAL (#wikimedia-operations) [2018-08-09T19:02:18Z] <niharika29@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Add temporary debug logging to ThumbnailRenderJob T201305 (duration: 00m 57s)
Change 450033 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] ThumbnailRenderJob: normalize parameters before generating thumb URL
Change 450033 merged by jenkins-bot:
[mediawiki/core@master] ThumbnailRenderJob: normalize parameters before generating thumb URL
After the fix was deployed, the original issue have disappeared from the logs, although now we still have some occasional timeouts, which are retried, and sometimes we get 429 from Thumbor.
I will close this task and open a new one regarding 429 responses as it seems unrelated