Page MenuHomePhabricator

Videos intermittently failing to transcode with error "Exception: Shellbox server returned status code 503"
Open, MediumPublicPRODUCTION ERROR

Description

Steps to replicate the issue:

  • (Presumably) Upload a video / reset a transcode
  • Occasionally, the transcode appears to fail with the error "Exception: Shellbox server returned status code 503"

Examples noticed (from enwiki's Special:Transcode_statistics; this probably isn't a complete list though as some may have since had the transcode successfully reset):

Judging by https://en.wikipedia.org/wiki/File:Das_indische_Grabmal_(1921)_Die_Sendung_des_Yoghi.webm, this reason for failure may have been occurring since at least 2024-09-06T06:44:18.

If I've put this Quarry query together correctly, as of filing this task, this error has been returned for 60 transcoding errors of 132 on Commons since the start of the year.

What should have happened instead?:
Transcoding is successfully completed.

Notes:
I've tagged ServiceOps new in case you're the right team to be able to investigate what's happening here (based on you being tagged in T373517), feel free to untag/change though & apologies if the tag is incorrect!

Event Timeline

jijiki triaged this task as Medium priority.EditedFeb 3 2025, 12:18 PM
jijiki changed the subtype of this task from "Bug Report" to "Production Error".
jijiki subscribed.

While I have not gone back one by one to match timestamps, but from a quick peek I found that

  • This error may appear during shellbox-video deployments, for example
02:45 swfrench-wmf: scaled down shellbox-video/migration after switch to PHP 8.1 - T377038
02:44 swfrench@deploy2002: helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
02:44 swfrench@deploy2002: helmfile [codfw] START helmfile.d/services/shellbox-video: apply

matches

https://en.wikipedia.org/wiki/File:Szale%C5%84cy_(Daredevils,_1928)_by_Leonard_Buczkowski.webm (VP9 720P, error on 2025-01-30T02:44:57)
  • may appear when a container is being restarted due to readiness probes failing

https://logstash.wikimedia.org/goto/a7706f106bfffe1ad7c81aa69a789eb6

image.png (240×2 px, 74 KB)

matches

https://en.wikipedia.org/wiki/File:Les_Mis%C3%A9rables_(1925),_episode_2.webm (WebM 360P, error on 2025-01-25T07:33:39)

Noting T376914: Missing metadata on TIFF file due to api.php?action=query&prop=stashimageinfo failing with "ShellboxError: Shellbox server returned status code 503" which has a similar error description — That one is specifically tagged against MediaWiki-extensions-PagedTiffHandler, though, and as I can’t see behind the scenes I don’t know whether or not it’s the same underlying error(s) as here :)

Thanks for connecting the dots, here, @jijiki, and for reporting the issue @A_smart_kitten.

With the changes in T385225 live, errors during video encoding such as these should now be consistently retried (as originally intended).

One point of note, though: My understanding is that when a transcode does fail in this manner, a status such as Exception: Shellbox server returned status code 503 may transiently be displayed in the "Transcode status" table (e.g., between retry attempts).

Thanks, @Scott_French! Would it be safe to say that - if a 'transcode status' row now remains on "Exception: Shellbox server returned status code 503" for an extended period of time (i.e., several days (?)) - an issue still exists somewhere/things aren't working as intended? If so, I'll try and occasionally rerun https://quarry.wmcloud.org/query/90379 to keep an eye on that, and re-report the issue/leave a comment here if that occurs (although fingers crossed it doesn't :)).

Thanks, @A_smart_kitten. If a transcode that started after ~ 14:30 UTC today gets persistently stuck in Exception: Shellbox server returned status code 503 status, then that indeed suggests something unexpected is happening.

Exactly how long to call "persistent" is a tough one, because I'm not exactly sure how an internal retry of WebVideoTranscodeJob will look like in the UI (e.g., as far as I can tell, the job does not reset the error status until it eventually succeeds). Perhaps @hnowlan might know? In any case, "days" certainly sounds like a reasonable upper bound.

Unfortunately shellbox doesn't give us a lot of granularity of response here - a 503 could be a problem with the shellbox service itself, or it could be a problem with the command run within shellbox (ie ffmpeg fails when called against the file in question). If shellbox returns 503 errors for all versions of a file, it's more likely that the file itself is at fault rather than shellbox.

A retried job will not update the status until successful.

Given that there are very different tools at work between the two, I'd say that T376914 is not directly related to the issues we might be seeing here but there is definite overlap in that the error from shellbox is so vague that it almost seems similar.

Unfortunately shellbox doesn't give us a lot of granularity of response here […]

Based on timing and based on the other 503 errors we see for services contacted through Envoy, I think it's most likely that this error doesn't come from "Shellbox" or even from the Envoy instance running inside/alongside the receiver end as part of the Shellbox service, but rather this is coming from the local outgoing Envoy on the MediaWiki pod, given that we shutdown traffic through that _before_ shutting down the MediaWiki container.

From MediaWiki's perspective it reports that as a response "from Shellbox" because we configure Shellbox URLs such as http://localhost:6026/shell/pagedtiffhandler-metadata.

See also:

timestamp: 2025-04-29T13:23:57+00:00
reqId:  ce67ac12-8bb2-45c6-9f61-f33db0f1bcdb
exception:

cURL error 56: Recv failure: Connection reset by peer (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:6026/shell/pagedtiffhandler-metadata
from /srv/mediawiki/php-1.44.0-wmf.25/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(276)
…
#17 /srv/mediawiki/php-1.44.0-wmf.25/extensions/PagedTiffHandler/includes/PagedTiffImage.php(153): Shellbox\Command\BoxedCommand->execute()
#18 /srv/mediawiki/php-1.44.0-wmf.25/extensions/PagedTiffHandler/includes/PagedTiffHandler.php(128): MediaWiki\Extension\PagedTiffHandler\PagedTiffImage->retrieveMetaData()
#19 /srv/mediawiki/php-1.44.0-wmf.25/includes/upload/UploadBase.php(545): MediaWiki\Extension\PagedTiffHandler\PagedTiffHandler->verifyUpload(string)
…
#22 /srv/mediawiki/php-1.44.0-wmf.25/includes/jobqueue/jobs/UploadJobTrait.php(102): MediaWiki\JobQueue\Jobs\PublishStashedFileJob->verifyUpload()
#23 /srv/mediawiki/php-1.44.0-wmf.25/extensions/EventBus/includes/JobExecutor.php(88): MediaWiki\JobQueue\Jobs\PublishStashedFileJob->run()
#24 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
timestamp: 2025-04-30T00:02:28+00:00
reqId: f01d13b7-d20b-4677-99e8-d3f150394d19
exception:

cURL error 7: Failed to connect to localhost port 6026: Connection refused (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:6026/shell/pagedtiffhandler-metadata

from /srv/mediawiki/php-1.44.0-wmf.25/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(275)
…
#17 /srv/mediawiki/php-1.44.0-wmf.25/extensions/PagedTiffHandler/includes/PagedTiffImage.php(153): Shellbox\Command\BoxedCommand->execute()
#18 /srv/mediawiki/php-1.44.0-wmf.25/extensions/PagedTiffHandler/includes/PagedTiffHandler.php(128): MediaWiki\Extension\PagedTiffHandler\PagedTiffImage->retrieveMetaData()
#19 /srv/mediawiki/php-1.44.0-wmf.25/includes/upload/UploadBase.php(545): MediaWiki\Extension\PagedTiffHandler\PagedTiffHandler->verifyUpload(string)
…
#22 /srv/mediawiki/php-1.44.0-wmf.25/includes/jobqueue/jobs/UploadJobTrait.php(102): MediaWiki\JobQueue\Jobs\PublishStashedFileJob->verifyUpload()
#23 /srv/mediawiki/php-1.44.0-wmf.25/extensions/EventBus/includes/JobExecutor.php(88): MediaWiki\JobQueue\Jobs\PublishStashedFileJob->run()
#24 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
timestamp: 2025-04-30T07:20:26+00:00
reqId: f8e65402-be7e-465e-9407-a9f21b0b3364
exception:

cURL error 52: Empty reply from server (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:6026/shell/pagedtiffhandler-metadata

from /srv/mediawiki/php-1.44.0-wmf.25/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(275)
…
#17 /srv/mediawiki/php-1.44.0-wmf.25/extensions/PagedTiffHandler/includes/PagedTiffImage.php(153): Shellbox\Command\BoxedCommand->execute()
#18 /srv/mediawiki/php-1.44.0-wmf.25/extensions/PagedTiffHandler/includes/PagedTiffHandler.php(128): MediaWiki\Extension\PagedTiffHandler\PagedTiffImage->retrieveMetaData()
…
#23 /srv/mediawiki/php-1.44.0-wmf.25/extensions/EventBus/includes/JobExecutor.php(88): MediaWiki\JobQueue\Jobs\PublishStashedFileJob->run()
#24 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)