We are in the process to move all of MediaWiki to kubernetes, and that includes all jobs, including webVideoTranscodeJob, which presents a series of challenges to us.
We will be pursuing solution 2 detailed below. TimedMediaHandler has already been adapted to use in Shellbox and is currently handling scaling work on metal.
Work that needs to be done towards this:
- It appears that the existing work on the mw-script namespace and mwscript on k8s tooling fits our requirements to a reasonably large degree. We need to identify if there are any obvious edge cases stopping us from reusing and slightly modifying this work that would either constrain us or break the work on mw-script.
- Write a service to pull webVideoTranscode and webVideoTranscodePrioritised jobs from the jobqueue and trigger an action based upon this to call mwscript. Currently, with some modifications it seems like we can reuse mwscript_k8s to call RunSingleJob.
- RunSingleJob currently takes the event input via php://input, but we want to pass the event either as an argument or via stdin. We will need to modify the RunSingleJob code in order to do this, or find a way to pass to php://input that isn't horribly hacky.
- In order to pass that data to RunSingleJob, we might need to modify mwscript_k8s.py to optionally supply the arguments via stdin if needed.
Problem statement
Specifically:
- Right now, these jobs are submitted to the jobrunners via HTTP by changeprop, thus requiring http timeouts to be raised to 1 day as some transcodes can be extremely long-running. This means that we can't restart the php-fpm daemons for every release we do, or video transcodes will never finish.
- The job shells out to ffmpeg and another couple of softwares to transcode videos. ffmpeg will, by default, use as many threads as useful given the number of cpus on the host. That makes very unpredictable the maximum amount of CPU resources a pod would use, if we kept not defining limits. Defining limits, OTOH, will most likely result in heavy throttling
- The job traditionally limits memory usage by the shellout using modification of cgroups, that won't be possible on kubernetes like it was on bare metal. While obviously kubernetes has facilities to limit the amount of memory a pod can use, that won't be as good as there is a chance the OOM killer kills the wrong process, thus killing the whole pod instead of the shellout.
Each of these problems make videoscaling incompatible with our current setup on kubernetes. We have some ways out of all of the above problems, but none is particularly comfortable.
Cpu usage limits
Let's start with the easiest problem to solve: ffmpeg supports, at least in modern versions the --threads switch, which can limit the number of threads ffmpeg uses, so we have a way to put an upper bound on the amount of CPUs it uses with a slight modification of TimedMediaHandler's WebVideoTranscodeJob::ffmpegEncode and adding a configuration variable.
Timeouts
Needing to leave an http request running for up to one day has all sorts of problems, including the fact we can't deploy to k8s without killing the running pods. So we could either decide we will run a cronjob releasing code to the videoscalers once a day, thus making them running on stale code for potentially a long time, with all the consequences for security issues, for instance, which is undesirable. A potentially better alternative is to write a software (or modify changeprop to do it) that can take the jobs from kafka, then run them as kubernetes Job instances via the command line. This would need us to both write this software and a special maintenance script for mediawiki that can take a json job definition as input, which is quite easy to do.
Memory limits
This is actually more or less impossible to solve properly if the transcode is running locally. We clearly need to either convert TimedMediaHandler to use shellbox for execution of ffmpeg, or to find an alternative off-the-shelf system for transcoding videos that is designed for kubernetes and we can call from the job itself. Alternatively, we can try to tune the OOM killer to ensure the probability of killing the process using most memory in a cgroup is high enough that the number of "dirty kills" should be small enough.
Potential solutions
Let's review the possible solutions we came up with during our internal kick-off meeting
Solution 1: minimum effort
- Create a mw-videoscaler deployment of mediawiki with a 1 day http timeout
- Run a cron every day to update the code
- Add code to run ffmpeg with a limited number of threads
Solution 2: proper management
- Create a mw-videoscaler namespace
- write a smallish software that can read jobs from a kafka topic, then call the kubernetes api to spawn a Job to run a mediawiki maintenance script that can take a json job definition as input, with a preset concurrency
- Use shellbox or an off-the shelf software for video transcoding that is k8s-native to actually perform the transcode
- This means adapting TimedMediaHandler's code in a much deeper way
- Also means we either need a new shellbox instance or a completely different software
I would probably personally go with the latter, as it would noticeably improve how we run videoscaling, instead of making actively slightly worse than it is now. Depending on how easy it is to use Shellbox in TMH, I would actually go with that.