Page MenuHomePhabricator

Port videoscaling to kubernetes
Open, HighPublic

Description

We are in the process to move all of MediaWiki to kubernetes, and that includes all jobs, including webVideoTranscodeJob, which presents a series of challenges to us.

We will be pursuing solution 2 detailed below. TimedMediaHandler has already been adapted to use in Shellbox and is currently handling scaling work on metal.

Work that needs to be done towards this:

  • It appears that the existing work on the mw-script namespace and mwscript on k8s tooling fits our requirements to a reasonably large degree. We need to identify if there are any obvious edge cases stopping us from reusing and slightly modifying this work that would either constrain us or break the work on mw-script.
  • Write a service to pull webVideoTranscode and webVideoTranscodePrioritised jobs from the jobqueue and trigger an action based upon this to call mwscript. Currently, with some modifications it seems like we can reuse mwscript_k8s to call RunSingleJob.
  • RunSingleJob currently takes the event input via php://input, but we want to pass the event either as an argument or via stdin. We will need to modify the RunSingleJob code in order to do this, or find a way to pass to php://input that isn't horribly hacky.
  • In order to pass that data to RunSingleJob, we might need to modify mwscript_k8s.py to optionally supply the arguments via stdin if needed.

Problem statement

Specifically:

  • Right now, these jobs are submitted to the jobrunners via HTTP by changeprop, thus requiring http timeouts to be raised to 1 day as some transcodes can be extremely long-running. This means that we can't restart the php-fpm daemons for every release we do, or video transcodes will never finish.
  • The job shells out to ffmpeg and another couple of softwares to transcode videos. ffmpeg will, by default, use as many threads as useful given the number of cpus on the host. That makes very unpredictable the maximum amount of CPU resources a pod would use, if we kept not defining limits. Defining limits, OTOH, will most likely result in heavy throttling
  • The job traditionally limits memory usage by the shellout using modification of cgroups, that won't be possible on kubernetes like it was on bare metal. While obviously kubernetes has facilities to limit the amount of memory a pod can use, that won't be as good as there is a chance the OOM killer kills the wrong process, thus killing the whole pod instead of the shellout.

Each of these problems make videoscaling incompatible with our current setup on kubernetes. We have some ways out of all of the above problems, but none is particularly comfortable.

Cpu usage limits

Let's start with the easiest problem to solve: ffmpeg supports, at least in modern versions the --threads switch, which can limit the number of threads ffmpeg uses, so we have a way to put an upper bound on the amount of CPUs it uses with a slight modification of TimedMediaHandler's WebVideoTranscodeJob::ffmpegEncode and adding a configuration variable.

Timeouts

Needing to leave an http request running for up to one day has all sorts of problems, including the fact we can't deploy to k8s without killing the running pods. So we could either decide we will run a cronjob releasing code to the videoscalers once a day, thus making them running on stale code for potentially a long time, with all the consequences for security issues, for instance, which is undesirable. A potentially better alternative is to write a software (or modify changeprop to do it) that can take the jobs from kafka, then run them as kubernetes Job instances via the command line. This would need us to both write this software and a special maintenance script for mediawiki that can take a json job definition as input, which is quite easy to do.

Memory limits

This is actually more or less impossible to solve properly if the transcode is running locally. We clearly need to either convert TimedMediaHandler to use shellbox for execution of ffmpeg, or to find an alternative off-the-shelf system for transcoding videos that is designed for kubernetes and we can call from the job itself. Alternatively, we can try to tune the OOM killer to ensure the probability of killing the process using most memory in a cgroup is high enough that the number of "dirty kills" should be small enough.

Potential solutions

Let's review the possible solutions we came up with during our internal kick-off meeting

Solution 1: minimum effort

  • Create a mw-videoscaler deployment of mediawiki with a 1 day http timeout
  • Run a cron every day to update the code
  • Add code to run ffmpeg with a limited number of threads

Solution 2: proper management

  • Create a mw-videoscaler namespace
  • write a smallish software that can read jobs from a kafka topic, then call the kubernetes api to spawn a Job to run a mediawiki maintenance script that can take a json job definition as input, with a preset concurrency
  • Use shellbox or an off-the shelf software for video transcoding that is k8s-native to actually perform the transcode
    • This means adapting TimedMediaHandler's code in a much deeper way
    • Also means we either need a new shellbox instance or a completely different software

I would probably personally go with the latter, as it would noticeably improve how we run videoscaling, instead of making actively slightly worse than it is now. Depending on how easy it is to use Shellbox in TMH, I would actually go with that.

Event Timeline

Joe triaged this task as High priority.Jan 18 2024, 7:46 AM
Joe created this task.
Joe renamed this task from [DRAFT] Port videoscaling to kubernetes to Port videoscaling to kubernetes.Jan 18 2024, 9:37 AM
Joe updated the task description. (Show Details)

Adding @brion as the resident expert / maintainer of TimedMediaHandler. I'd like to get your opinion on how hard it would be to port WebVideoTranscodeJob to use shellbox :)

Couple quick notes:

  • Reducing thread count is IMO a very bad idea, as most of the time there will be few jobs and they may be high resolution videos. You want to use the maximum number of available threads to keep CPUs occupied or else you're going to waste a lot of time and make the jobs run a lot slower at high resolutions (the slowest jobs). It would be better to sometimes be at full load on a server than to have a single job that takes days instead of hours or hours instead of minutes.
  • The task roughly checks out a source file, decides an ffmpeg command line or two, runs them and then stores and processes the resulting output file. There are two ffmpeg passes at present for better bitrate control but if need be a single pass command can be used instead. This may run up to several hours for very large or long videos -- thus this is where we have the potential to split the job into three components:
  • regular media wiki job gets the info and builds the command options and sends that to a script or service
  • which transcodes the file according to given command settings, without media wiki logic itself and calls back to mediawiki API which queues a
  • cleanup job to import the file and update database state

Ill take a more thorough look in a bit but this sounds fairly cleanly doable.

Another complication on thread count -- the VP9 encoder can only make use of so many threads effectively, based on the size of the frame (controls number of macroblocks that can be simultaneously run) :D So usage is tough to predict from the job type. We *could* implement some kind of variable size thing where we "block up" so a 240p takes 2 cpus and a 2160 uses 8 cpus etc but I don't know if that can be expressed sensibly.

A harder, but possibly desirable possibility I mentioned on IRC: we could encode each ~10-second input chunk separately, then stitch them back together on completion. This would allow long-duration and high-resolution files to divide up into multiple chunks that can run simultaneously, each on a limited and predictable max core count.

It's a bit more work to do the stitching, but is straightforward to do via another ffmpeg command. Requires more work to track the chunk files though, and may require checking out the source file many times.

Change 992199 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: Add usernames for mw-videoscaler

https://gerrit.wikimedia.org/r/992199

Change 992200 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] admin_ng: add namespace for mw-videoscaler

https://gerrit.wikimedia.org/r/992200

Change 992627 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/extensions/TimedMediaHandler@master] Convert midiToAudioEncode to use BoxedCommand

https://gerrit.wikimedia.org/r/992627

Change 992200 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: add namespace for mw-videoscaler

https://gerrit.wikimedia.org/r/992200

Change 992199 merged by Hnowlan:

[operations/puppet@production] kubernetes: Add usernames for mw-videoscaler

https://gerrit.wikimedia.org/r/992199

At least as far as this task is concerned, T292322 isn't a problem - the fact that video processing will inevitably be somewhat slowed down by the need to transfer large files to shellbox isn't really relevant when transcoding is both asynchronous and very expensive in terms of wall clock time already, so the slowdown will be mostly irrelevant.

Change #1020860 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] wip: mw-videoscaler: helmfile scaffolding

https://gerrit.wikimedia.org/r/1020860

Change #1020860 merged by jenkins-bot:

[operations/deployment-charts@master] mw-videoscaler: helmfile scaffolding

https://gerrit.wikimedia.org/r/1020860