Page MenuHomePhabricator

mw-jobrunner curl errors when talking to other services
Closed, DeclinedPublicPRODUCTION ERROR

Description

I have noticed a vast numbers of curl timeouts coming from the mw-jobrunners

https://logstash.wikimedia.org/goto/50d17abf3123f4330986c011f14ae177

I reckon it is worth understanding the impact, as well investigating why they occur

Event Timeline

MLechvien-WMF changed the subtype of this task from "Task" to "Production Error".Fri, Jan 23, 10:20 AM
MLechvien-WMF subscribed.

Volume seems very high, I'm surprised this does not fire a more visible alert.

I think this is logspam. Looking at logstash most instances seem to happen in ThumbnailRenderJob which sets a low timeout (1s) and is supposed to ignore timeout errors because the point is only to hit swift's 404 handler with a HEAD request to forward the thumbnailing to Thumbor https://gerrit.wikimedia.org/g/mediawiki/core/+/c3f19ea0dbcfa693e08ae573023c6128d16d5f40/includes/JobQueue/Jobs/ThumbnailRenderJob.php#112

Unfortunately, looking at https://gerrit.wikimedia.org/g/mediawiki/core/+/c3f19ea0dbcfa693e08ae573023c6128d16d5f40/includes/Http/GuzzleHttpRequest.php (which handles the libcurl call) I don't see an easy way to suppress that error message.

As to why it doesn't fire an alert, that's probably because it's errors and not exceptions.

As part of T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only, there's is ongoing work to stop pregenerating thumbnails, which would retire ThumbnailRenderJob. It's currently blocked on T415282: MediaSearch should stop relying on render map config.

Declining this task, given our limited capacity it's better to wait until the ongoing work retires the culprit job