Create a deployment for `shellbox-timedmedia`
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Feb 12 2024, 11:56 AM

Description

(The name is just a random proposal; please use the one that fits better - keep in mind we're going to process audio and video here)

This deployment should be similar to the other shellbox deployments, but there's also some differences:

We definitely need to revisit limits/requests here. Given we're always setting the number of threads for ffmpeg, we can predict how many cores we need per shellbox request. We will thus need overall enough CPUs to run videoscaling at the current maximum concurrency. So tot_php_workers = concurrency_webVideoTranscode + concurrency_webVideoTranscodePrioritized, and we need about ffmpeg_threads CPU per worker as request. This will most likely also need max_memory_per_transcode (currently, 4 GB) of memory per worker.
We might need to adapt some numbers in the apache setup to support large files, and/or write size file limits
This might become a very noisy neighbour in terms of i/o and cpu usage. It might be sensible to think of ways to reserve some k8s nodes to async payloads like this one, that don't need low latency.
Timeout for requests needs to be set higher than the timeout we set for videoscaling jobs (so, 1 day)

Finally, we need to set up LVS for this shellbox installation as well - both for the long timeouts and for handling of large files.

Details

Subject	Repo	Branch	Lines +/-
LabsServices: add port for shellbox-video	operations/mediawiki-config	master	+1 -1
shellbox-video: set timeout to one day	operations/puppet	production	+1 -1
Add shellbox-video discovery	operations/dns	master	+2 -0
service: set shellbox-video to production	operations/puppet	production	+1 -1
service: set shellbox-video to lvs_setup	operations/puppet	production	+1 -1
shellbox-video: drop timeout slightly	operations/deployment-charts	master	+2 -1
services_proxy: add shellbox-video listener	operations/puppet	production	+7 -0
shellbox-video: drop requests/replicas	operations/deployment-charts	master	+10 -1
admin_ng: bump limits for shellbox-video	operations/deployment-charts	master	+30 -1
shellbox-video: initial helmfile configuration	operations/deployment-charts	master	+113 -0
service: add basic config for shellbox-video	operations/puppet	production	+39 -0
Add records for shellbox-video service	operations/dns	master	+4 -2
shellbox: add PHP + Apache timeout settings	operations/deployment-charts	master	+11 -3

Related Objects
Search...

Status	Assigned	Task
Open	None	T355292 Port videoscaling to kubernetes
Open	None	T356241 Move video transcoding to use Shellbox
Resolved	hnowlan	T357309 Create a deployment for `shellbox-timedmedia`

Event Timeline

Joe created this task.Feb 12 2024, 11:56 AM

kamila claimed this task.Feb 12 2024, 6:15 PM

Change 1003446 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] [WIP] create a shellbox deployment for videoscalers

https://gerrit.wikimedia.org/r/1003446

gerritbot added a project: Patch-For-Review.Feb 14 2024, 3:27 PM

Change 1005139 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] shellbox: add PHP-FPM process_control_timeout setting

https://gerrit.wikimedia.org/r/1005139

This might become a very noisy neighbour in terms of i/o and cpu usage. It might be sensible to think of ways to reserve some k8s nodes to async payloads like this one, that don't need low latency.

In similar vein, I've been thinking if it makes sense to have a difference between prioritized and non-prioritized transcode jobs for the available nodes. For prioritized we want more immediacy, more standby 'idle' capacity, than for the non-prioritized ones, which are more continual, grind as much as you can load.

What we don't want is for the non-prio jobs to congest all the available nodes (say 10 very long hour+ running transcodes taking up 10-reserved-for-transcode-k8s nodes at the same time). Making it not possible to handle any prio transcode jobs for that duration.

kamila changed the task status from Open to In Progress.Feb 21 2024, 10:42 AM

TehKittyCat subscribed.Apr 22 2024, 5:18 AM

Change #1005139 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: add PHP + Apache timeout settings

https://gerrit.wikimedia.org/r/1005139

Change #1043724 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: add basic config for shellbox-video

https://gerrit.wikimedia.org/r/1043724

Change #1043812 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] DNM: Add shellbox-video vars/config

https://gerrit.wikimedia.org/r/1043812

Change #1043815 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/dns@master] Add records for shellbox-video service

https://gerrit.wikimedia.org/r/1043815

Change #1043817 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/dns@master] Add shellbox-video discovery

https://gerrit.wikimedia.org/r/1043817

Change #1043815 merged by Hnowlan:

[operations/dns@master] Add records for shellbox-video service

https://gerrit.wikimedia.org/r/1043815

Change #1047098 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] services_proxy: add shellbox-video listener

https://gerrit.wikimedia.org/r/1047098

Change #1043724 merged by Hnowlan:

[operations/puppet@production] service: add basic config for shellbox-video

https://gerrit.wikimedia.org/r/1043724

Change #1003446 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: initial helmfile configuration

https://gerrit.wikimedia.org/r/1003446

Change #1047124 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] admin_ng: bump limits for shellbox-video

https://gerrit.wikimedia.org/r/1047124

Change #1047124 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: bump limits for shellbox-video

https://gerrit.wikimedia.org/r/1047124

Change #1047491 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video: drop requests/replicas

https://gerrit.wikimedia.org/r/1047491

Change #1047491 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: drop requests/replicas

https://gerrit.wikimedia.org/r/1047491

Change #1047098 merged by Hnowlan:

[operations/puppet@production] services_proxy: add shellbox-video listener

https://gerrit.wikimedia.org/r/1047098

Change #1047523 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] shellbox-video: set timeout to one day

https://gerrit.wikimedia.org/r/1047523

Change #1047537 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video: drop timeout slightly

https://gerrit.wikimedia.org/r/1047537

Change #1047537 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: drop timeout slightly

https://gerrit.wikimedia.org/r/1047537

Change #1047976 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: set shellbox-video to lvs_setup

https://gerrit.wikimedia.org/r/1047976

Change #1047976 merged by Hnowlan:

[operations/puppet@production] service: set shellbox-video to lvs_setup

https://gerrit.wikimedia.org/r/1047976

Mentioned in SAL (#wikimedia-operations) [2024-06-20T15:46:50Z] <hnowlan@cumin1002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T357309)

Mentioned in SAL (#wikimedia-operations) [2024-06-20T15:54:29Z] <hnowlan@cumin1002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T357309)

Mentioned in SAL (#wikimedia-operations) [2024-06-20T15:59:48Z] <hnowlan@cumin1002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T357309)

Mentioned in SAL (#wikimedia-operations) [2024-06-20T16:07:48Z] <hnowlan@cumin1002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T357309)