Page MenuHomePhabricator

wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure"
Closed, ResolvedPublicPRODUCTION ERROR

Description

Started immediately when wmf.14 went to group1, and stopped when it was undeployed.

https://logstash.wikimedia.org/goto/8c959f0dc2c3eb54b14e6de6dc1741e6

Source factors:

  • wikis: commons.wikimedia.org
  • domain: jobrunner.discovery.wmnet
  • hosts: many different mw* hosts.

Example:

channel: JobExecutor
type: ERROR
message:

   Failed executing job: ThumbnailRender File:Tour_des_Marques_Castle_of_Chenonceau_01.jpg requestId=W1iJuApAAEsAAAhZJ8MAAADP transformParams={"width":1024}

reqId: W1iJuApAAEsAAAhZJ8MAAADP
job_type: ThumbnailRender
error: "ThumbnailRenderJob::run: HTTP request failure"

The file name and thumbnail are just an example of one, but the actual errors are for many different files and sizes.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
zeljkofilipin triaged this task as Unbreak Now! priority.Jul 25 2018, 3:32 PM
zeljkofilipin subscribed.

Priority is UBN because it's blocking the train.

15:36:06 +marktraceur | greg-g: Do we have any more details from the various logs?
15:37:24       greg-g | marktraceur: nothing off hand
15:37:30 +marktraceur | Hmm
15:37:39 +marktraceur | "HTTP request failure" is a pretty vague message
15:38:50 +marktraceur | greg-g: Maybe gilles has thoughts but my knowledge of arcane Thumbor lore is pretty low
15:39:06 +marktraceur | If not I can try to dig
15:39:15       greg-g | isn't he on vacation?
15:39:24 +marktraceur | Oh, gosh. OK.
15:39:29       greg-g | his calendar says yes
15:39:53 +marktraceur | The other person I can think of that might have a slight edge on me for thumbnailing job trivia is brion
15:45:31       greg-g | brion: do you have time to investigate that thumbnail issue that's blocking the train? ^

cc'ing @Gilles and @brion.

Thumbor logs for the same period.

There are several things that should be improved in the logging:

  • either Thumbor should log the MediaWiki request ID, or MediaWiki should log the Thumbor error ID (the latter is more brittle as Thumbor might fail hard enough that it can't return a response at all, which seems to be the case here)
  • ThumbnailRenderJob::hitThumbUrl should log the error message from the Status returned by the HTTP call, not just discard it silently.
  • nitpick: _type:logback for thumbor is rather unintuitive. Is That something generic? If not, could it just use _type:thumbor? (also, could we have a thumbor dashbard in logstash?)

Gilles is on vacation at the moment, unlikely that he'll see this. I'll poke around in his absence to see if I can figure out what's going on.

Thumbor errors whose frequency matches the shape of the MediaWiki errors somewhat: [[https://logstash.wikimedia.org/goto/52a0c99a5580ba6e04167530164db7cc|Throttled by PoolCounter]], [[https://logstash.wikimedia.org/goto/8d3754c14750653c00444887dc4b48ba|[ExiftoolRunner] error]]. The first sounds more plausible (I seem to recall the core platform team doing some PoolCounter work recently, also could just be caused by something in the new branch generating more thumbnail requests) but there are way more of those than MediaWiki errors; also the MediaWiki error suggests that Thumbor died without returning any kind of response and I imagine PoolCounter errors would be handled nicer than that. The ExiftoolRunner errors have the right frequency but does not sound like something that would kill a request.

Of course, something that causes Thumbor to not return any HTTP status might break logging as well.

@Tgr PoolCounter throttling is not new. During the same period yesterday, 403 requests were throttled by PoolCounter, for example.

None of the errors in the Thumbor log are new, but the PoolCounter and ExiftoolRunner ones spike at the same time when the errors on the MediaWiki side do. Given that this was a train deploy (so MediaWiki changed but Thumbor or its OS environment didn't), PoolCounter would seem plausible, except there are ~150 thumbnail job errors and ~1000 PoolCounter errors (which makes sense since most thumbnailing requests do not come from the job, that's just for freshly uploaded files, but then the spike is either a coincidence or something caused the non-job requests to jump as well).

At a glance, none of the wmf.14 commits in core or VipsScaler seem related, and I don't think anything else is involved in JPG thumbnail generation.

Some debugging. First, on commonswiki (from terbium mwmaint1001) which is on wmf.13, [with extraneous command echos removed]

hphpd> $transformParams = [ 'width' => 1024 ];
hphpd> $file = wfLocalFile( Title::newFromText( 'File:Tour_des_Marques_Castle_of_Chenonceau_01.jpg' ) );
hphpd> $file->load( File::READ_LATEST );
hphpd> =$file->exists()
true
hphpd> $thumbName = $file->thumbName( $transformParams );
hphpd> $thumbUrl = $file->getThumbUrl( $thumbName );
hphpd> =$wgUploadThumbnailRenderHttpCustomDomain
"upload.svc.eqiad.wmnet"
hphpd> $parsedUrl = wfParseUrl( $thumbUrl );
hphpd> $thumbUrl = '//' . $wgUploadThumbnailRenderHttpCustomDomain . $parsedUrl['path'];
hphpd> =$thumbUrl
"//upload.svc.eqiad.wmnet/wikipedia/commons/thumb/4/44/Tour_des_Marques_Castle_of_Chenonceau_01.jpg/1024px-Tour_des_Marques_Castle_of_Chenonceau_01.jpg"
hphpd> $request = MWHttpRequest::factory( $thumbUrl, [ 'method' => 'HEAD', 'followRedirects' => true ], __METHOD__ );
hphpd> $request->setHeader( 'Host', $wgUploadThumbnailRenderHttpCustomHost );
hphpd> $status = $request->execute();
hphpd> =$status->isOk()
false
hphpd> =$status->getMessage()->plain()
"Error fetching URL: Could not resolve host: upload.svc.eqiad.wmnet"
hphpd> =$request->getStatus();
200

Huh, there's an error but the status code is 200. Let's try that request on testwiki (from mwdebug1001) which is still on wmf.14:

hphpd> $thumbUrl = "//upload.svc.eqiad.wmnet/wikipedia/commons/thumb/4/44/Tour_des_Marques_Castle_of_Chenonceau_01.jpg/1024px-Tour_des_Marques_Castle_of_Chenonceau_01.jpg"
hphpd> $request = MWHttpRequest::factory( $thumbUrl, [ 'method' => 'HEAD', 'followRedirects' => true ], __METHOD__ );
hphpd> $request->setHeader( 'Host', $wgUploadThumbnailRenderHttpCustomHost );
hphpd> $status = $request->execute();
hphpd> =$status->isOk()
false
hphpd> =$status->getMessage()->plain()
"Error fetching URL: Could not resolve host: upload.svc.eqiad.wmnet"
hphpd> =$request->getStatus();
0

Ah ha. And rMW0584339f5edf: Added non-parallel fallback to MultiHttpClient when curl is unavailable is new in wmf.14 and did indeed make a change along those lines:

diff --git a/includes/http/MWHttpRequest.php b/includes/http/MWHttpRequest.php
index 71e08a815f..257955c13d 100644
--- a/includes/http/MWHttpRequest.php
+++ b/includes/http/MWHttpRequest.php
@@ -387,6 +388,11 @@ abstract class MWHttpRequest implements LoggerAwareInterface {
        protected function parseHeader() {
                $lastname = "";
 
+               // Failure without (valid) headers gets a response status of zero
+               if ( !$this->status->isOK() ) {
+                       $this->respStatus = '0';
+               }
+
                foreach ( $this->headerList as $header ) {
                        if ( preg_match( "#^HTTP/([0-9.]+) (.*)#", $header, $match ) ) {
                                $this->respVersion = $match[1];

So it looks like the actual thumbnail calls were probably already failing because upload.svc.eqiad.wmnet can't be resolved, and wmf.14 just made MWHttpRequest->getStatus() stop returning an incorrect 200 status code with the error so ThumbnailRenderJob now actually notices.

Okay, so this is the sanest method I could find for correlating Thumbor and MediaWiki logs: look at MediaWiki log, get filename, open file page + open thumbnail in Commons to get file URL prefix, search for url:"/wikipedia/commons/thumb/<hash>/*" in the given time period. (Trying to include anything more from the URL gives no results; the number of false positives I get this way are manageable though.)

That suggests the Thumbor error does not get logged (with an url field, at least). Sometimes a PoolCounter error is logged but that seems accidental; many of the thumbnails have no matching logs in Thumbor.

The domain was removed in rODNS0593daa89b07: Remove {bits,text,upload,mobile}.svc.$site.wmnet. I guess the thumbnail job should just use the public domain name instead (ie. don't set $wgUploadThumbnailRenderHttpCustomDomain)?

I guess short term fix is to disable thumbnail prerendering since it is broken anyway (and watch the media dashboard to make sure no actually working functionality was affected).

@Tgr I think that's right. Do you mind doing so?

Change 447843 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/mediawiki-config@master] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain

https://gerrit.wikimedia.org/r/447843

Actually, I take that back. We should be able to just remove the wgUploadThumbnailRenderHttpCustomDomain setting from wmf-config. Patch incoming.

At risk of asking the obvious – have we decided that the change that triggered this (failure becoming status=0) is intended, and that the caller was already at fault before, but we couldn't see it?

I suspect it might also be that both are the case (the Thumbnail caller was already at fault, but the change in question might also have introduced a bug). Reason being that 1) it's unclear to me where the 200 previously came from, and 2) I don't see any test that specifically expects getStatus to return zero.

Once the source of 200 is known, and we decide the behaviour is intended, we may want to document/test that as well for future reference. E.g. that getStatus() will ignore the server's status code if other headers are invalid, or that 0 is used for all such errors.

200 is the default value for that property; overriding it when there are errors but no response headers to parse is definitely the right thing to do. One might argue that it should be set to null or maybe 500 instead but all those options would have the same result here, just with slightly different error messages.

At risk of asking the obvious – have we decided that the change that triggered this (failure becoming status=0) is intended, and that the caller was already at fault before, but we couldn't see it?

Seems clear to me.

  1. it's unclear to me where the 200 previously came from,

From the default value of MWHttpRequest->respStatus being "200 Ok", which got used when the headerList did not have anything resembling an HTTP Status-Line.

  1. I don't see any test that specifically expects getStatus to return zero.

I don't see any tests for it either. MWHttpRequest in general doesn't seem to have tests, considering the state of https://doc.wikimedia.org/cover/mediawiki-core/includes/http/MWHttpRequest.php.html.

E.g. that getStatus() will ignore the server's status code if other headers are invalid, or that 0 is used for all such errors.

Neither of those is correct. "0 is returned if no HTTP Status-Line was found, or if it doesn't have a Status-Code after the HTTP-Version" would be closer.

"0 is returned if no HTTP Status-Line was found, or if it doesn't have a Status-Code after the HTTP-Version" would be right.

I see. So all network/transport level errors, and responses without a status code (in the above form) would result in 0. That seems worth testing, but thanks for confirming, that at least rules out any need for follow-up (non-test code-wise) on the http lib patch.

At this point, just waiting on someone with access to merge the config change linked above

Change 447843 merged by jenkins-bot:
[operations/mediawiki-config@master] wmf-config: Comment out wgUploadThumbnailRenderHttpCustomDomain

https://gerrit.wikimedia.org/r/447843

Mentioned in SAL (#wikimedia-operations) [2018-07-25T20:05:15Z] <krinkle@deploy1001> Synchronized wmf-config/: Remove wgUploadThumbnailRenderHttpCustomDomain override - T200346 (duration: 00m 57s)

Krinkle claimed this task.

Tentatively closing. We'll find out when we reactivate wmf.14 on group1.

I see. So all network/transport level errors, and responses without a status code

FYI, this is also why browser AJAX requests return status code 0 for CORS errors.

Change 448021 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Update custom domain for thumbnail rendering

https://gerrit.wikimedia.org/r/448021

Change 448021 merged by jenkins-bot:
[operations/mediawiki-config@master] Update custom domain for thumbnail rendering

https://gerrit.wikimedia.org/r/448021

Mentioned in SAL (#wikimedia-operations) [2018-07-26T21:19:43Z] <tgr@deploy1001> Synchronized wmf-config/ProductionServices.php: SWAT: T200346 Update custom domain for thumbnail rendering (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-07-26T21:21:14Z] <tgr@deploy1001> Synchronized wmf-config/LabsServices.php: SWAT: T200346 Update custom domain for thumbnail rendering (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2018-07-26T21:22:49Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: T200346 Update custom domain for thumbnail rendering (duration: 00m 54s)

Change 450033 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] [WIP] Improve error logging in ThumbnailRenderJob

https://gerrit.wikimedia.org/r/450033

Change 450036 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Set request ID when calling Thumbor

https://gerrit.wikimedia.org/r/450036

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

Change 450036 abandoned by Krinkle:
[mediawiki/core@master] Set request ID when calling Thumbor

Reason:
Superseded, latest master has been doing $this->setHeader( 'X-Request-Id', WebRequest::getRequestId() ); for a while now.

https://gerrit.wikimedia.org/r/450036