Page MenuHomePhabricator

Move video transcoding to use Shellbox
Closed, ResolvedPublic

Description

This needs the following to happen:

  • Create a new shellbox deployment called shellbox-video or similar
  • Create a new flavour of the shellbox image, including ffmpeg and fluidsynth at least, to use in that deployment
  • Convert the TimedMediaHandler extension to use BoxedCommand instead of UnboxedCommand (via wfShellExec)
  • Configure MediaWiki in production to use the remote shellbox installation

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+6 -4
operations/deployment-chartsmaster+2 -2
operations/mediawiki-configmaster+0 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+8 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+27 -5
mediawiki/services/change-propagationmaster+16 -4
operations/deployment-chartsmaster+14 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -0
operations/mediawiki-configmaster+1 -0
operations/mediawiki-configmaster+1 -0
operations/mediawiki-configmaster+1 -0
operations/deployment-chartsmaster+5 -0
operations/mediawiki-configmaster+1 -3
operations/deployment-chartsmaster+5 -5
operations/deployment-chartsmaster+5 -5
operations/mediawiki-configmaster+1 -8
operations/alertsmaster+13 -1
operations/mediawiki-configmaster+1 -0
operations/mediawiki-configmaster+5 -0
operations/mediawiki-configmaster+3 -1
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+131 -35
operations/deployment-chartsmaster+12 -2
operations/deployment-chartsmaster+712 -0
operations/puppetproduction+7 -1
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+2 -1
operations/mediawiki-configmaster+9 -0
operations/deployment-chartsmaster+10 -5
operations/mediawiki-configmaster+1 -0
operations/mediawiki-configmaster+14 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-08-21T13:32:44Z] <cdanis@deploy1003> Started scap sync-world: Backport for [[gerrit:1064348|Enable shellbox-video for enwiki (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-08-21T13:35:06Z] <cdanis@deploy1003> hnowlan, cdanis: Backport for [[gerrit:1064348|Enable shellbox-video for enwiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-08-21T13:40:02Z] <cdanis@deploy1003> Finished scap sync-world: Backport for [[gerrit:1064348|Enable shellbox-video for enwiki (T356241)]] (duration: 07m 18s)

Change #1064389 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Use shellbox-video for videoscaling on group2

https://gerrit.wikimedia.org/r/1064389

Change #1064390 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] use shellbox-video for commonswiki

https://gerrit.wikimedia.org/r/1064390

Change #1064392 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/alerts@master] kubernetes-wikikube: ignore shellbox-video unavailable replicas

https://gerrit.wikimedia.org/r/1064392

Change #1064392 merged by jenkins-bot:

[operations/alerts@master] kubernetes-wikikube: ignore shellbox-video unavailable replicas

https://gerrit.wikimedia.org/r/1064392

Change #1064389 merged by jenkins-bot:

[operations/mediawiki-config@master] Use shellbox-video for videoscaling on group2

https://gerrit.wikimedia.org/r/1064389

Mentioned in SAL (#wikimedia-operations) [2024-08-22T13:18:27Z] <samtar@deploy1003> Started scap sync-world: Backport for [[gerrit:1064389|Use shellbox-video for videoscaling on group2 (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-08-22T13:23:02Z] <samtar@deploy1003> hnowlan, samtar: Backport for [[gerrit:1064389|Use shellbox-video for videoscaling on group2 (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-08-22T13:27:38Z] <samtar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1064389|Use shellbox-video for videoscaling on group2 (T356241)]] (duration: 09m 10s)

Change #1060104 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video, admin_ng: bump resource limits and replicas

https://gerrit.wikimedia.org/r/1060104

Change #1064811 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video, admin-ng: big increase in resource allocation

https://gerrit.wikimedia.org/r/1064811

Change #1064811 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video, admin-ng: big increase in resource allocation

https://gerrit.wikimedia.org/r/1064811

Change #1064390 merged by jenkins-bot:

[operations/mediawiki-config@master] use shellbox-video globally (adding group2, including commons)

https://gerrit.wikimedia.org/r/1064390

Mentioned in SAL (#wikimedia-operations) [2024-08-26T13:37:33Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-08-26T13:40:19Z] <urbanecm@deploy1003> hnowlan, urbanecm: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-08-26T13:45:38Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1064390|use shellbox-video globally (adding group2, including commons) (T356241)]] (duration: 08m 04s)

hnowlan claimed this task.

TimedMediaHandler now uses shellbox by default.

Awesome! Thanks for taking care of this.

Change #1067963 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop-jobqueue: retry once on videoscaling jobs

https://gerrit.wikimedia.org/r/1067963

Change #1067963 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: retry once on videoscaling jobs

https://gerrit.wikimedia.org/r/1067963

Change #1070561 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Allow copyuploads on test2wiki

https://gerrit.wikimedia.org/r/1070561

Change #1070561 merged by jenkins-bot:

[operations/mediawiki-config@master] Allow copyuploads on test2wiki

https://gerrit.wikimedia.org/r/1070561

Mentioned in SAL (#wikimedia-operations) [2024-09-04T13:44:08Z] <samtar@deploy1003> Started scap sync-world: Backport for [[gerrit:1070561|Allow copyuploads on test2wiki (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-04T14:00:58Z] <samtar@deploy1003> Started scap sync-world: Backport for [[gerrit:1070561|Allow copyuploads on test2wiki (T356241)]]

Change #1070891 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Allow copyuploads on test2wiki

https://gerrit.wikimedia.org/r/1070891

Change #1070891 merged by jenkins-bot:

[operations/mediawiki-config@master] Allow copyuploads on test2wiki

https://gerrit.wikimedia.org/r/1070891

Mentioned in SAL (#wikimedia-operations) [2024-09-05T13:11:06Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1070891|Allow copyuploads on test2wiki (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-05T13:14:57Z] <hashar@deploy1003> hnowlan, hashar: Backport for [[gerrit:1070891|Allow copyuploads on test2wiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-05T13:22:52Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1070891|Allow copyuploads on test2wiki (T356241)]] (duration: 11m 45s)

Change #1070948 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Enable Copyupload-allowed-domains on test2wiki

https://gerrit.wikimedia.org/r/1070948

Change #1070948 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable Copyupload-allowed-domains on test2wiki

https://gerrit.wikimedia.org/r/1070948

Mentioned in SAL (#wikimedia-operations) [2024-09-09T14:05:33Z] <jforrester@deploy1003> Started scap sync-world: Backport for [[gerrit:1071566|Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]], [[gerrit:1070948|Enable Copyupload-allowed-domains on test2wiki (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-09T14:08:18Z] <jforrester@deploy1003> seanleong-wmde, jforrester, hnowlan: Backport for [[gerrit:1071566|Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]], [[gerrit:1070948|Enable Copyupload-allowed-domains on test2wiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-09T14:17:50Z] <jforrester@deploy1003> Finished scap sync-world: Backport for [[gerrit:1071566|Revert "Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis." (T66315)]], [[gerrit:1070948|Enable Copyupload-allowed-domains on test2wiki (T356241)]] (duration: 12m 16s)

Change #1071628 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Enable async uploads on test2wiki

https://gerrit.wikimedia.org/r/1071628

Change #1071628 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable async uploads on test2wiki

https://gerrit.wikimedia.org/r/1071628

Mentioned in SAL (#wikimedia-operations) [2024-09-09T16:15:08Z] <hnowlan@deploy1003> Started scap sync-world: Backport for [[gerrit:1071628|Enable async uploads on test2wiki (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-09T16:19:11Z] <hnowlan@deploy1003> hnowlan: Backport for [[gerrit:1071628|Enable async uploads on test2wiki (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-09T16:26:20Z] <hnowlan@deploy1003> Finished scap sync-world: Backport for [[gerrit:1071628|Enable async uploads on test2wiki (T356241)]] (duration: 11m 11s)

Change #1071659 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Enable Copyupload-allowed-domain on testwiki, disable on test2

https://gerrit.wikimedia.org/r/1071659

Change #1071659 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable Copyupload-allowed-domain on testwiki, disable on test2

https://gerrit.wikimedia.org/r/1071659

Mentioned in SAL (#wikimedia-operations) [2024-09-10T10:05:04Z] <hnowlan@deploy1003> Started scap sync-world: Backport for [[gerrit:1071659|Enable Copyupload-allowed-domain on testwiki, disable on test2 (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-10T10:08:49Z] <hnowlan@deploy1003> hnowlan: Backport for [[gerrit:1071659|Enable Copyupload-allowed-domain on testwiki, disable on test2 (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-10T10:14:44Z] <hnowlan@deploy1003> Finished scap sync-world: Backport for [[gerrit:1071659|Enable Copyupload-allowed-domain on testwiki, disable on test2 (T356241)]] (duration: 09m 39s)

Change #1085579 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] admin_ng: set a very high quota for shellbox-video

https://gerrit.wikimedia.org/r/1085579

Change #1085598 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: add optional .spec.strategy override

https://gerrit.wikimedia.org/r/1085598

Change #1085598 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: add optional .spec.strategy override

https://gerrit.wikimedia.org/r/1085598

Summarizing a bit of debugging at the end of last week:

After shellbox-video was enabled in commons last week (October 31st), we ran into capacity issues and ended up reverting the change.

One part of that is the "new" model for transcode capacity when using shellbox-video: each pod can run exactly one transcode at a time - i.e., the number of pods is the concurrency limit.

While we can add more replicas when needed (aside from complexity related to how in-use pods are considered unavailable, which https://gerrit.wikimedia.org/r/1085598 hopes to address), it was surprising that we were able to consume all 32 replicas given that webVideoTranscode and webVideoTranscodePrioritized (1) each only have a single partition and (2) had concurrency limits configured in changeprop of 5 and 4 respectively (now 3 and 4).

This would suggest either (1) we were somehow "leaking" transcodes or (2) kafka is frequently reassigning partitions. After some investigation, there's evidence that #2 is happening: when changeprop hits the concurrency limit and stops calling into consume (and in turn poll) while waiting for in-flight jobs to complete (which can last quite some time), we can run afoul of max.poll.interval.ms which defaults to 300s.

When the interval expires, the consumer group is rebalanced and the partition(s) reassigned to another group member. Further, when a previously timed-out member finally polls, that triggers another rebalance and can again result in reassignment. Taken together, you can end up with multiple group members with in-flight transcodes.

Errors associated with exceeding the max poll interval can be seen in https://logstash.wikimedia.org/goto/5805c35f776d19e7fcab535efbbe3a73. Similarly, you can see frequent rebalancing on broker side (10k lines of logs was ~ 36h at the time this was collected):

kafka-main2007 $ tail -10000 /var/log/kafka/server.log | grep 'Preparing to rebalance group' | cut -f 3 -d ']' | awk '{print $6}' | sort | uniq -c | sort -nr | head -3
    598 cpjobqueue-webVideoTranscode
     52 poppy-codfw.maps.tiles_change
     16 cpjobqueue-RenderTranslationPageJob

In any case, in order to reason properly about transcode job concurrency, we should investigate setting max.poll.interval.ms to something more representative of job duration (maximum allowed is 1d). @hnowlan is taking care of adding support for that to Mercurius in https://gitlab.wikimedia.org/repos/sre/mercurius/-/merge_requests/4.

Even though the path forward for managing these jobs is to use Mercurius, we should consider also adding the same functionality to changeprop in the interim, particularly if it makes reasoning about and managing shellbox-video capacity easier (to facilitate the migration).

Change #1087226 had a related patch set uploaded (by Scott French; author: Scott French):

[mediawiki/services/change-propagation@master] kafka_factory: add consumer-level property overrides

https://gerrit.wikimedia.org/r/1087226

Change #1087226 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] kafka_factory: add consumer-level property overrides

https://gerrit.wikimedia.org/r/1087226

Change #1087542 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop: add per-rule consumer properties in jobqueue

https://gerrit.wikimedia.org/r/1087542

Change #1087557 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop-jobqueue: update to 2024-11-05-170900-production

https://gerrit.wikimedia.org/r/1087557

Change #1087558 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop-jobqueue: set max poll interval and revert concurrency

https://gerrit.wikimedia.org/r/1087558

Change #1087542 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: add per-rule consumer properties in jobqueue

https://gerrit.wikimedia.org/r/1087542

Change #1087557 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: update to 2024-11-05-170900-production

https://gerrit.wikimedia.org/r/1087557

Mentioned in SAL (#wikimedia-operations) [2024-11-07T18:14:31Z] <swfrench-wmf> updated changeprop-jobqueue to 2024-11-05-170900-production - T356241

Change #1087558 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: set max poll interval and revert concurrency

https://gerrit.wikimedia.org/r/1087558

Change #1090519 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop-jobqueue: double concurrency for transcodes

https://gerrit.wikimedia.org/r/1090519

Change #1090519 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: double concurrency for transcodes

https://gerrit.wikimedia.org/r/1090519

Change #1090526 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] TimedMediahandler: reenable shellbox-video for commons

https://gerrit.wikimedia.org/r/1090526

Change #1090567 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop-jobqueue: increase webVideoTranscode concurrency to 15

https://gerrit.wikimedia.org/r/1090567

Following up on T356241#10291014:

  • With max.poll.interval.ms now set to 1h around 16:07 UTC, we've seen only a single timeout / partition reassignment so far, which happened at ~ 20:16 UTC. This is a pretty impressive improvement vs. the previous state, where we were seeing the partition cycling across 3+ pods over the course of an hour.
  • Around 17:35, we bumped the concurrency to 10 and 8 for webVideoTranscode and webVideoTranscodePrioritized respectively, after noting the videoscaler hosts remain fairly lightly loaded.

From a quick spot check, we're completing webVideoTranscode jobs at about the same rate as the hours leading up to the max.poll.interval.ms change. This is plausible, since although we've only doubled the concurrency while previously the partition was cycling across 3 pods, the "steady state" effective concurrency was not 3 x 5 (that's the peak).

There are two directions we could push this (in the spirit of not varying multiple knobs at once):

  1. We can increase max.poll.interval.ms yet more, now that it seems to do what we expect, in an effort to further drive down the likelihood of reassignments.
  2. We can increase the concurrency for webVideoTranscode, in an effort to clear the sizable backlog we've built over the past couple of weeks, while also to some extent driving down the likelihood if reassignment.

My vote would be to start with #2, which is what https://gerrit.wikimedia.org/r/1090567 proposes, particularly given that videoscalers remain only modestly loaded and with the impending roll-forward to shellbox-video for commons, we'll have some elasticity work with.

Change #1090567 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: increase webVideoTranscode concurrency to 15

https://gerrit.wikimedia.org/r/1090567

Change #1090526 merged by jenkins-bot:

[operations/mediawiki-config@master] TimedMediahandler: reenable shellbox-video for commons

https://gerrit.wikimedia.org/r/1090526

Mentioned in SAL (#wikimedia-operations) [2024-11-13T14:24:52Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1090526|TimedMediahandler: reenable shellbox-video for commons (T356241)]]

Mentioned in SAL (#wikimedia-operations) [2024-11-13T14:27:28Z] <lucaswerkmeister-wmde@deploy2002> hnowlan, lucaswerkmeister-wmde: Backport for [[gerrit:1090526|TimedMediahandler: reenable shellbox-video for commons (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-11-13T14:32:20Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1090526|TimedMediahandler: reenable shellbox-video for commons (T356241)]] (duration: 07m 28s)

Change #1090898 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] changeprop-jobqueue: bump max.poll.interval.ms to 2h

https://gerrit.wikimedia.org/r/1090898

Change #1090898 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: bump max.poll.interval.ms to 2h

https://gerrit.wikimedia.org/r/1090898

As of the 13th of November, all video transcoding has been moved to shellbox-video. The service seems quite stable. We'll reclaim the videoscaler hardware at a later point.

Change #1085579 abandoned by Hnowlan:

[operations/deployment-charts@master] admin_ng: set a very high quota for shellbox-video

https://gerrit.wikimedia.org/r/1085579