Rework job queue usage for TimedMediaHandler (video scalers)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• brooke
	Jan 11 2017, 5:07 PM

Description

We've had a number of problems with the job queue usage for TimedMediaHandler on the video scalers:

job queue's HHVM connections time out during long-running jobs, causing overscheduling which bogs down servers (set to run 4 jobs but might have 16 running)
small server pool and extremely variable queue depth; most of the time little activity but sometimes HUNDREDS of batch uploads flood the system with jobs, which get backed up
no prioritization when jobs are backed up -- everything goes first-come first-serve, meaning new uploads may have *no* transcodes while high-res transcode jobs are running on a batch upload or someone's attempt to re-encode old transcodes

I'm planning to retool the jobs for TMH to help fix/workaround these:

On a new upload, instead of queueing immediate jobs for each output format/resolution, queue a single job
When the job is processed, check in transcode table how many transcodes are currently running -- if more than a threshold, stop right there!
If under threshold, then check the transcode table for the best next candidate -- if no candidates, stop. Prioritize by resolution and then upload time, so low-resolution versions always get done faster and output is more quickly visible.
When processing is complete, queue another job.

This should avoid requiring any changes to the actual job queue system; while the long-running jobs may still time out, the threshold check will prevent the explosion of extra processes. The prioritization should help a lot with getting good user-visible behavior when we're flooded with large upload sets.

Details

Subject	Repo	Branch	Lines +/-
Bump up number of queue runners for transcodes	operations/puppet	production	+3 -3
TMH job queue split into low and high priority	mediawiki/extensions/TimedMediaHandler	wmf/1.29.0-wmf.11	+29 -1
TMH job queue split into low and high priority	mediawiki/extensions/TimedMediaHandler	master	+29 -1
Split TMH transcode queue into two for prioritization	operations/puppet	production	+16 -4
Exclude new high-priority video transcode jobs from default queue	operations/mediawiki-config	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T153357 Please upload large file to Wikimedia Commons
Duplicate	None	T153137 Please upload large file to Wikimedia Commons
Duplicate	None	T153136 Please upload large file to Wikimedia Commons
Duplicate	None	T152969 Please upload large file to Wikimedia Commons
Duplicate	None	T152943 Please upload large file to Wikimedia Commons
Resolved	Dereckson	T153572 Please upload large file to Wikimedia Commons
Resolved	Dereckson	T153809 Server side upload for Jasonanaggie
Resolved	Dereckson	T153931 Please upload large file to Wikimedia Commons
Declined	Dereckson	T152942 Please upload large file to Wikimedia Commons
Resolved	Dereckson	T154101 Server side upload for Jasonanaggie
Resolved	Dereckson	T154102 Server side upload for Sporti
Resolved	• brooke	T110916 Video transcode job runner on beta cluster runs 5 jobs even though configured for 2
Resolved	hoo	T154186 Certain videos on Commons do not start at all or have interruptions
Resolved	Dereckson	T152938 Server side upload for Jasonanaggie
Resolved	Revent	T153488 Commons video transcoders have over 6500 tasks in the backlog.
Duplicate	None	T114337 Assign 3 more servers to video scaler duty
Duplicate	None	T150067 Extend capacity for video scalers
Declined	Dereckson	T154733 Temporarily unassign 'transcode-reset' user right from Commons autoconfirmed and sysop groups
Resolved	matmarex	T154737 Add entries to Special:Log regarding use of transcode resets
Resolved	• brooke	T155098 Rework job queue usage for TimedMediaHandler (video scalers)

Event Timeline

• brooke created this task.Jan 11 2017, 5:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2017, 5:07 PM

• brooke added parent tasks: T153488: Commons video transcoders have over 6500 tasks in the backlog., T110916: Video transcode job runner on beta cluster runs 5 jobs even though configured for 2, T154186: Certain videos on Commons do not start at all or have interruptions.Jan 11 2017, 5:11 PM

Yann subscribed.Jan 11 2017, 5:12 PM

• brooke added a subscriber: Joe.Jan 11 2017, 5:13 PM

Revent subscribed.Jan 11 2017, 5:18 PM

Note -- need to be able to trigger additional jobs at _start_ not just at _end_ to make sure we're multitasking. :)

Paladox subscribed.Jan 11 2017, 5:20 PM

fgiunchedi subscribed.Jan 11 2017, 5:43 PM

joe informs me the timeout problem is solved, which should help reduce the need to check for thresholds. :D

May want to divide into high-priority queue and low-priority queues, it's not too hard to set them up.

• brooke mentioned this in T155110: JobRunner transaction fname for Job::run() can mismatch __METHOD__ in a subclass.Jan 11 2017, 6:43 PM

Change 331662 had a related patch set uploaded (by Brion VIBBER):
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/331662

gerritbot added a project: Patch-For-Review.Jan 11 2017, 7:09 PM

Change 331668 had a related patch set uploaded (by Brion VIBBER):
Split TMH transcode queue into two for prioritization

https://gerrit.wikimedia.org/r/331668

Change 331669 had a related patch set uploaded (by Brion VIBBER):
Exclude new high-priority video transcode jobs from default queue

https://gerrit.wikimedia.org/r/331669

Change 331669 merged by jenkins-bot:
Exclude new high-priority video transcode jobs from default queue

https://gerrit.wikimedia.org/r/331669

elukey subscribed.Jan 11 2017, 11:19 PM

zhuyifei1999 awarded a token.Jan 12 2017, 4:21 AM

zhuyifei1999 subscribed.

Revent awarded a token.Jan 13 2017, 11:16 AM

Dereckson mentioned this in T152938: Server side upload for Jasonanaggie.Jan 13 2017, 2:38 PM

Dereckson added a parent task: T152938: Server side upload for Jasonanaggie.

• brooke mentioned this in T153488: Commons video transcoders have over 6500 tasks in the backlog..Jan 13 2017, 7:52 PM

Uh so what's the ETA for this?

In T155098#2933628, @brion wrote:

joe informs me the timeout problem is solved, which should help reduce the need to check for thresholds. :D

I don't see the timeout problem as solved. All big transcodes (720p, 1080p, and long ones) still fail.
And we have 360,000+ failed transcodes, and growing.

In T155098#2950438, @Yann wrote:

I don't see the timeout problem as solved. All big transcodes (720p, 1080p, and long ones) still fail.
And we have 360,000+ failed transcodes, and growing.

I'm pretty sure he meant the apache timeout (T153488#2887091), not the transcode timeout.

In T155098#2950438, @Yann wrote:

In T155098#2933628, @brion wrote:

joe informs me the timeout problem is solved, which should help reduce the need to check for thresholds. :D

I don't see the timeout problem as solved. All big transcodes (720p, 1080p, and long ones) still fail.
And we have 360,000+ failed transcodes, and growing.

Transcodes might fail for a number of reasons, not necessarily timeouts. In fact, any transcode that takes less than one day to run should not fail because of issues in the way the videoscalers are set up; moreover, while monitoring TMH/videoscalers, I've seen several hundreds long videos (~ 1 hour plus) being successfully transcoded to 1080p/720p, which mostly proves the problem is not system-level timeouts anymore.

@brion before the change to TMH goes into production, we also need to tweak the jobrunner setup in operations/puppet.

In T155098#2951616, @Joe wrote:

Transcodes might fail for a number of reasons, not necessarily timeouts. In fact, any transcode that takes less than one day to run should not fail because of issues in the way the videoscalers are set up; moreover, while monitoring TMH/videoscalers, I've seen several hundreds long videos (~ 1 hour plus) being successfully transcoded to 1080p/720p, which mostly proves the problem is not system-level timeouts anymore.

FYI: We still have a transcode background time limit in TMH ($wgTranscodeBackgroundTimeLimit).

FYI: We still have a transcode background time limit in TMH ($wgTranscodeBackgroundTimeLimit).

That's the reason most big transcodes fail now.
I think they should pass with a longer timeout (16h or 24h).
Is there any reason not to allow a longer timeout?

@Joe the puppet changes are in https://gerrit.wikimedia.org/r/#/c/331668/ -- can you take a look and see if it makes sense?

@Yann can you break out a separate task for raising the background time limit? I want to make sure we don't lose that. Currently most of the failures I'm seeing aren't reporting back what their errors were, which makes it tricky to see what's going on, I'll have to poke around and see why that's going on.

In T155098#2953300, @brion wrote:

@Yann can you break out a separate task for raising the background time limit? I want to make sure we don't lose that. Currently most of the failures I'm seeing aren't reporting back what their errors were, which makes it tricky to see what's going on, I'll have to poke around and see why that's going on.

OK done in T155750.

(Ok I can find the error messages, it's just not linking to the popup anymore. Weird.) Yeah, I see those "Exitcode: 137" which indicates a SIGKILL... but the times don't seem to match up with a strict 8-hour limit. I see some apparently killed at 6 hours or so, others less. Actually I think they're dying on the memory limit -- "Exitcode: 137 / Memory: 4194304" etc.

Ah that 'memory' is just the memory limit being reported. That doesn't help us know why it died. Sigh.

Memory usage looks sane enough logging in... I do see a few processes in 240+ minutes of CPU time, longest running now is 346 minutes.

It also occurs to me that ulimit handles *CPU time* not wall-clock time, which may explain why the limit seems to go off prematurely -- ffmpeg uses multithreading to do some operations in parallel (at least decode / scale / encode, and there may or may not be some parallelization in the final encode) so CPU usage sits around 175% rather than 100% on these jobs.

Ok switching over to T155750 for that. :)

Wesalius subscribed.Jan 20 2017, 11:19 AM

Yann added a comment.Jan 22 2017, 10:18 AM

This comment was removed by Yann.

Yann awarded a token.Feb 2 2017, 2:52 PM

Change 331668 merged by Filippo Giunchedi:
Split TMH transcode queue into two for prioritization

https://gerrit.wikimedia.org/r/331668

Change 331662 merged by jenkins-bot:
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/331662

Change 336846 had a related patch set uploaded (by Brion VIBBER):
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/336846

Ok, the puppet changes went out at this morning's puppet swat, mediawiki side is ready to go for the 11am pacific mediawiki swat.

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-02-14_(1.29.0-wmf.12)).Feb 9 2017, 6:00 PM

Change 336846 merged by jenkins-bot:
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/336846

Mentioned in SAL (#wikimedia-operations) [2017-02-09T19:47:56Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.11/extensions/TimedMediaHandler/TimedMediaHandler.php: SWAT: [[gerrit:336846|TMH job queue split into low and high priority]] PART I T155098 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-02-09T19:49:02Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.11/extensions/TimedMediaHandler/TimedMediaHandler.hooks.php: SWAT: [[gerrit:336846|TMH job queue split into low and high priority]] PART II T155098 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-09T19:50:04Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.11/extensions/TimedMediaHandler: SWAT: [[gerrit:336846|TMH job queue split into low and high priority]] PART III T155098 (duration: 00m 44s)

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)).Feb 9 2017, 8:00 PM

MW changes are live. Existing files will continue to process on the low-prio queue. I'll watch the server load for the next couple days and see if we need to change the queue runner balance between the two.

• brooke mentioned this in T157527: Flac and wav files do not show play options in Commons.Feb 9 2017, 8:09 PM

The prioritized queue is working well, but I'll probably raise the number of non-prioritized workers today as we're now underutilizing the systems.

Good news:

https://grafana.wikimedia.org/dashboard/db/job-queue-health?var-jobType=webVideoTranscodePrioritized

shows the prioritized jobs have a 99th percentile of wait time below 1 hour, and most times in the order of a few minutes. This is a decisive improvement over what we had before.

Change 337230 had a related patch set uploaded (by Brion VIBBER):
Bump up number of queue runners for transcodes

https://gerrit.wikimedia.org/r/337230

Change 337230 merged by Giuseppe Lavagetto:
Bump up number of queue runners for transcodes