Page MenuHomePhabricator

Assign 3 more servers to video scaler duty
Closed, DuplicatePublic

Description

Once we enable VP9 transcodes this'll increase the CPU usage of video scaling due to the slower VP9 encoder. Will want to at least double capacity from 3 to 6 machines.

These can be app servers if available.

Related Objects

Event Timeline

brion created this task.Oct 1 2015, 12:45 AM
brion raised the priority of this task from to Needs Triage.
brion updated the task description. (Show Details)
brion added a subscriber: brion.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 1 2015, 12:45 AM
Dzahn set Security to None.
Dzahn added a subscriber: Dzahn.

added "hardware-requests"

Dzahn triaged this task as Normal priority.Oct 19 2015, 11:21 PM
RobH added subscribers: Joe, RobH.EditedOct 21 2015, 11:52 PM

I don't think the application cluster is too busy, and we should be able to snag one from each row and append into the range.

I'd suggest we just append one to the existing ranges:

  • mw1153-1160 are imagescalers (trusty)
  • mw1161-1188 are apaches

So take mw1161 and re-allocate to imagescaler

  • mw1236-mw1258 are apaches
  • mw1259-60 are videoscalers

So we'll take mw1257 & mw1258 and re-allocate to imagescaler.

Since this is taking hosts from the general pool, I'd like some secondary review on this. Looking at the load across the eqiad apache cluster, it seems like it is lower and has more systems to pull from than the api cluster. @Joe would likely be the best to ask.

RobH added a comment.Oct 22 2015, 12:16 AM
This comment was removed by RobH.
Reedy added a subscriber: Reedy.Oct 22 2015, 12:22 AM

T116256 contains 3 eqiad appservers (non api) that have been idle for at least a month, and a 4th recently idle

RobH claimed this task.Oct 27 2015, 12:55 AM

I'll be taking three of those idle systems off T116256 as Reedy points out. Since they are idle, they won't be missed!

There are four on that task, so I'll investigate to ensure none are idle to to hw issues, and take one from there.

Joe added a comment.Oct 28 2015, 7:16 AM

Hi,

I don't think we really need 6 videoscalers, or at least I don't see a compelling reason for that given:

  1. The current videoscalers are already 50% more capacity than we had previously
  2. If you look at the load, the average utilization of said servers is around 15% over the last week

In fact I was going to remove the temporary one (mw1152) today and I still plan to do it.

I think having 3 videoscalers is a good starting point and we can expand the fleet from there if needed. I could agree with 4, having 6 straight away seems like a wast of resources to me.

So I will reduce their number to 2 and @RobH can reimage a couple of the idle ones for now, and then we can assess their utilization once we've migrated to vp9 encoding.

brion added a comment.Oct 28 2015, 8:11 AM

@Joe I'm planning to re-run all the Ogg transcodes for improved quality and to fix a bunch of old ones that broke; this will eat all the CPU time for a while... and when we add VP9 that'll also burn them continuously for a few weeks at least.

elukey added a subscriber: elukey.Dec 21 2016, 10:49 AM

In T153488 we repurposed two jobrunners to videoscalers (mw116[89]), so now the total eqiad cluster is 4. We spent a bit of time solving an apache<->hhvm timeout issue, but at the moment the queue is really big and keeps growing.

Current config:

mw1259/60 -> hhvm-threads 10, runners_transcode: 5
mw116[89] -> hhvm-threads 15, runners_transcode: 10

Load and CPU utilization look really good, but the hosts are probably not working at maximum throughput. Before proceeding further with tuning, I'd like to have other opinions about how we should proceed from somebody that knows better than me the jobrunners internals (probably @brion).

Current status:

webVideoTranscode: 15002 queued; 820 claimed (197 active, 623 abandoned); 0 delayed
RobH removed RobH as the assignee of this task.Feb 6 2017, 6:02 PM
brion changed the task status from Open to Stalled.Jan 20 2018, 12:24 AM

Abandoning this for now; will reopen if necessary once closer to VP9. (We're pretty close though with the Stretch update.)