Videoscalers fail health checks while CPU is maxed
Open, HighPublic
Actions

Assigned To

None

Authored By

	RLazarus
	Apr 26 2022, 2:36 AM

Description

On Friday, after T306697, a big stack of videos was enqueued for transcoding all at once. This has historically been a tricky situation for the job queue, because ffmpeg consumes 100% of a machine's CPU while it chugs through the backlog, starving out other work.

We've done some work like T279100 to make sure other items on the job queue continue to make progress while ffmpeg is in that state, but there can be other effects of getting into this situation: this weekend, we were alerted because (due to CPU starvation) the videoscalers stopped answering health checks from Icinga, and presumably also from LVS.

We got a number of flapping IRC alerts like this:

<icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
<icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering

and paged a handful of times over the course of the weekend:

<icinga-wm>	 PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

This is partly a monitoring issue; those machines were still getting work done, even if they weren't responsive to Icinga, so there was nothing substantive to fix. But the only remedy was to downtime the LVS alert, which leaves us blind to real problems. On top of the monitoring problem, if LVS health checks were also missed, that would've affected load balancing and slowed down queue processing. (I haven't checked logs for a smoking gun but I think this did happen; at one point I happened to notice a host's CPU usage drop to 0%, probably because LVS marked it down.)

In the long run, we'll improve this situation a lot by moving to Kubernetes so that resource allocations are more elastic, and we can sustain sudden spurts in load like this. Until then our current cluster is the right size for normal operation, and we shouldn't throw hardware at the problem by provisioning the system for this rare spike condition. Instead (at least in this task) I want to focus on making sure the videoscalers continue to respond to health checks while under load.

As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that value might be an easy way to relieve the pressure. I don't have historical context for why it is that way, but if we can change it safely, it might be a good first step.

Details

	Subject	Repo	Branch	Lines +/-
	changeprop-jobqueue: reduce concurrency of video transcoding	operations/deployment-charts	master	+3 -4

Customize query in gerrit

Related Objects

Mentioned In: T278945: Add rate limiting to the jobqueue vidoscalers to prevent overloads
Mentioned Here: T279100: Have some dedicated jobrunners that aren't active videoscalers
T306697: Videos scalers cannot create jobs: Failed creating job from description

Event Timeline

RLazarus triaged this task as High priority.Apr 26 2022, 2:36 AM

RLazarus created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 26 2022, 2:36 AM

Another option would be to use cpu pinning via taskset(1), where ffmpeg is assigned to cpus 1-N and cpu 0 is left free to service health checks.

As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that value might be an easy way to relieve the pressure. I don't have historical context for why it is that way, but if we can change it safely, it might be a good first step.

It's just the php-fpm is started with that niceness and ffmpeg inherits it cause it's spawned from php. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/mediawiki/php.pp#258.

We can go into some discussion as to whether -19 is the proper value for this, but overall

this is a relative (compared to the niceness of the rest of the processes running on the host) number and not an absolute one.
running php and answering fast to requests is the 1 thing that mediawiki clusters are meant to do.

So I don't think there is room for minor improvements to that number itself.

What would actually be somewhat better would be to have mediawiki execute external processes with a saner priority. In the videoscalers case, probably something like 20.

In T306860#7882464, @jhathaway wrote:

Another option would be to use cpu pinning via taskset(1), where ffmpeg is assigned to cpus 1-N and cpu 0 is left free to service health checks.

Yup, probably in a similar change to the one suggested above about niceness. That would alleviate some problems.

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:18 PM

Removing SRE, has already been triaged to a more specific SRE subteam

Either solution proposed in this task is not currently supported by Extension:TimeMediaHandler, see https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/includes/WebVideoTranscode/WebVideoTranscodeJob.php#L410.

I would say that allowing to set a niceness level for such tasks would make sense.

It would still ultimately not matter if the problem was, as I think the problem most probably was that there was no php worker left to allow a connection to get a proper response.

I'm not sure we have a good solution at the moment besides "reduce concurrency for videoscaling jobs".

I would say we could try to reduce the concurrency enough, and add the niceness support, and that should buy some time.

Joe added a project: SRE-Sprint-Week-Sustainability-March2023.Mar 21 2023, 8:00 AM

Joe moved this task from Backlog to E_TOO_BIG_MAYBE_OKR? on the SRE-Sprint-Week-Sustainability-March2023 board.

Joe mentioned this in T278945: Add rate limiting to the jobqueue vidoscalers to prevent overloads.Mar 21 2023, 8:29 AM

Joe claimed this task.Mar 21 2023, 2:31 PM

Joe moved this task from E_TOO_BIG_MAYBE_OKR? to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.

Change 901602 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] changeprop-jobqueue: reduce concurrency of video transcoding

https://gerrit.wikimedia.org/r/901602

gerritbot added a project: Patch-For-Review.Mar 21 2023, 2:46 PM

Change 901602 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: reduce concurrency of video transcoding

https://gerrit.wikimedia.org/r/901602

Maintenance_bot removed a project: Patch-For-Review.Mar 22 2023, 9:30 AM

I wouldn't consider this task done, but we took all the actions that are reasonable on the SRE side of the issue. Retagging as necessary.

Joe removed Joe as the assignee of this task.Mar 23 2023, 7:10 AM

Joe moved this task from Doing to Not an SRE issue on the SRE-Sprint-Week-Sustainability-March2023 board.

Videoscalers fail health checks while CPU is maxedOpen, HighPublicActions

Description

Details

Related Objects

Event Timeline

Videoscalers fail health checks while CPU is maxed
Open, HighPublic
Actions