Page MenuHomePhabricator

Videoscalers fail health checks while CPU is maxed
Open, HighPublic

Description

On Friday, after T306697, a big stack of videos was enqueued for transcoding all at once. This has historically been a tricky situation for the job queue, because ffmpeg consumes 100% of a machine's CPU while it chugs through the backlog, starving out other work.

We've done some work like T279100 to make sure other items on the job queue continue to make progress while ffmpeg is in that state, but there can be other effects of getting into this situation: this weekend, we were alerted because (due to CPU starvation) the videoscalers stopped answering health checks from Icinga, and presumably also from LVS.

We got a number of flapping IRC alerts like this:

<icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
<icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering

and paged a handful of times over the course of the weekend:

<icinga-wm>	 PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

This is partly a monitoring issue; those machines were still getting work done, even if they weren't responsive to Icinga, so there was nothing substantive to fix. But the only remedy was to downtime the LVS alert, which leaves us blind to real problems. On top of the monitoring problem, if LVS health checks were also missed, that would've affected load balancing and slowed down queue processing. (I haven't checked logs for a smoking gun but I think this did happen; at one point I happened to notice a host's CPU usage drop to 0%, probably because LVS marked it down.)

In the long run, we'll improve this situation a lot by moving to Kubernetes so that resource allocations are more elastic, and we can sustain sudden spurts in load like this. Until then our current cluster is the right size for normal operation, and we shouldn't throw hardware at the problem by provisioning the system for this rare spike condition. Instead (at least in this task) I want to focus on making sure the videoscalers continue to respond to health checks while under load.

As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that value might be an easy way to relieve the pressure. I don't have historical context for why it is that way, but if we can change it safely, it might be a good first step.

Event Timeline

RLazarus created this task.

Another option would be to use cpu pinning via taskset(1), where ffmpeg is assigned to cpus 1-N and cpu 0 is left free to service health checks.

As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that value might be an easy way to relieve the pressure. I don't have historical context for why it is that way, but if we can change it safely, it might be a good first step.

It's just the php-fpm is started with that niceness and ffmpeg inherits it cause it's spawned from php. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/mediawiki/php.pp#258.

We can go into some discussion as to whether -19 is the proper value for this, but overall

  • this is a relative (compared to the niceness of the rest of the processes running on the host) number and not an absolute one.
  • running php and answering fast to requests is the 1 thing that mediawiki clusters are meant to do.

So I don't think there is room for minor improvements to that number itself.

What would actually be somewhat better would be to have mediawiki execute external processes with a saner priority. In the videoscalers case, probably something like 20.

Another option would be to use cpu pinning via taskset(1), where ffmpeg is assigned to cpus 1-N and cpu 0 is left free to service health checks.

Yup, probably in a similar change to the one suggested above about niceness. That would alleviate some problems.

Removing SRE, has already been triaged to a more specific SRE subteam

Either solution proposed in this task is not currently supported by Extension:TimeMediaHandler, see https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/includes/WebVideoTranscode/WebVideoTranscodeJob.php#L410.

I would say that allowing to set a niceness level for such tasks would make sense.

It would still ultimately not matter if the problem was, as I think the problem most probably was that there was no php worker left to allow a connection to get a proper response.

I'm not sure we have a good solution at the moment besides "reduce concurrency for videoscaling jobs".

I would say we could try to reduce the concurrency enough, and add the niceness support, and that should buy some time.

Change 901602 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] changeprop-jobqueue: reduce concurrency of video transcoding

https://gerrit.wikimedia.org/r/901602

Change 901602 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: reduce concurrency of video transcoding

https://gerrit.wikimedia.org/r/901602

I wouldn't consider this task done, but we took all the actions that are reasonable on the SRE side of the issue. Retagging as necessary.

Joe removed Joe as the assignee of this task.Mar 23 2023, 7:10 AM