On Friday, after T306697, a big stack of videos was enqueued for transcoding all at once. This has historically been a tricky situation for the job queue, because ffmpeg consumes 100% of a machine's CPU while it chugs through the backlog, starving out other work.
We've done some work like T279100 to make sure other items on the job queue continue to make progress while ffmpeg is in that state, but there can be other effects of getting into this situation: this weekend, we were alerted because (due to CPU starvation) the videoscalers stopped answering health checks from Icinga, and presumably also from LVS.
We got a number of flapping IRC alerts like this:
<icinga-wm> PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner <icinga-wm> PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
and paged a handful of times over the course of the weekend:
<icinga-wm> PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
This is partly a monitoring issue; those machines were still getting work done, even if they weren't responsive to Icinga, so there was nothing substantive to fix. But the only remedy was to downtime the LVS alert, which leaves us blind to real problems. On top of the monitoring problem, if LVS health checks were also missed, that would've affected load balancing and slowed down queue processing. (I haven't checked logs for a smoking gun but I think this did happen; at one point I happened to notice a host's CPU usage drop to 0%, probably because LVS marked it down.)
In the long run, we'll improve this situation a lot by moving to Kubernetes so that resource allocations are more elastic, and we can sustain sudden spurts in load like this. Until then our current cluster is the right size for normal operation, and we shouldn't throw hardware at the problem by provisioning the system for this rare spike condition. Instead (at least in this task) I want to focus on making sure the videoscalers continue to respond to health checks while under load.
As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that value might be an easy way to relieve the pressure. I don't have historical context for why it is that way, but if we can change it safely, it might be a good first step.