Page MenuHomePhabricator

Create a deployment for `shellbox-timedmedia`
Closed, ResolvedPublic

Description

(The name is just a random proposal; please use the one that fits better - keep in mind we're going to process audio and video here)

This deployment should be similar to the other shellbox deployments, but there's also some differences:

  • We definitely need to revisit limits/requests here. Given we're always setting the number of threads for ffmpeg, we can predict how many cores we need per shellbox request. We will thus need overall enough CPUs to run videoscaling at the current maximum concurrency. So tot_php_workers = concurrency_webVideoTranscode + concurrency_webVideoTranscodePrioritized, and we need about ffmpeg_threads CPU per worker as request. This will most likely also need max_memory_per_transcode (currently, 4 GB) of memory per worker.
  • We might need to adapt some numbers in the apache setup to support large files, and/or write size file limits
  • This might become a very noisy neighbour in terms of i/o and cpu usage. It might be sensible to think of ways to reserve some k8s nodes to async payloads like this one, that don't need low latency.
  • Timeout for requests needs to be set higher than the timeout we set for videoscaling jobs (so, 1 day)

Finally, we need to set up LVS for this shellbox installation as well - both for the long timeouts and for handling of large files.

Event Timeline

Change 1003446 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] [WIP] create a shellbox deployment for videoscalers

https://gerrit.wikimedia.org/r/1003446

Change 1005139 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] shellbox: add PHP-FPM process_control_timeout setting

https://gerrit.wikimedia.org/r/1005139

This might become a very noisy neighbour in terms of i/o and cpu usage. It might be sensible to think of ways to reserve some k8s nodes to async payloads like this one, that don't need low latency.

In similar vein, I've been thinking if it makes sense to have a difference between prioritized and non-prioritized transcode jobs for the available nodes. For prioritized we want more immediacy, more standby 'idle' capacity, than for the non-prioritized ones, which are more continual, grind as much as you can load.

What we don't want is for the non-prio jobs to congest all the available nodes (say 10 very long hour+ running transcodes taking up 10-reserved-for-transcode-k8s nodes at the same time). Making it not possible to handle any prio transcode jobs for that duration.

kamila changed the task status from Open to In Progress.Feb 21 2024, 10:42 AM

Change #1005139 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: add PHP + Apache timeout settings

https://gerrit.wikimedia.org/r/1005139

Change #1043724 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: add basic config for shellbox-video

https://gerrit.wikimedia.org/r/1043724

Change #1043812 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] DNM: Add shellbox-video vars/config

https://gerrit.wikimedia.org/r/1043812

Change #1043815 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/dns@master] Add records for shellbox-video service

https://gerrit.wikimedia.org/r/1043815

Change #1043817 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/dns@master] Add shellbox-video discovery

https://gerrit.wikimedia.org/r/1043817

Change #1043815 merged by Hnowlan:

[operations/dns@master] Add records for shellbox-video service

https://gerrit.wikimedia.org/r/1043815

Change #1047098 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] services_proxy: add shellbox-video listener

https://gerrit.wikimedia.org/r/1047098

Change #1043724 merged by Hnowlan:

[operations/puppet@production] service: add basic config for shellbox-video

https://gerrit.wikimedia.org/r/1043724

Change #1003446 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: initial helmfile configuration

https://gerrit.wikimedia.org/r/1003446

Change #1047124 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] admin_ng: bump limits for shellbox-video

https://gerrit.wikimedia.org/r/1047124

Change #1047124 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: bump limits for shellbox-video

https://gerrit.wikimedia.org/r/1047124

Change #1047491 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video: drop requests/replicas

https://gerrit.wikimedia.org/r/1047491

Change #1047491 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: drop requests/replicas

https://gerrit.wikimedia.org/r/1047491

Change #1047098 merged by Hnowlan:

[operations/puppet@production] services_proxy: add shellbox-video listener

https://gerrit.wikimedia.org/r/1047098

Change #1047523 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] shellbox-video: set timeout to one day

https://gerrit.wikimedia.org/r/1047523

Change #1047537 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video: drop timeout slightly

https://gerrit.wikimedia.org/r/1047537

Change #1047537 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: drop timeout slightly

https://gerrit.wikimedia.org/r/1047537

Change #1047976 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: set shellbox-video to lvs_setup

https://gerrit.wikimedia.org/r/1047976

Change #1047976 merged by Hnowlan:

[operations/puppet@production] service: set shellbox-video to lvs_setup

https://gerrit.wikimedia.org/r/1047976

Mentioned in SAL (#wikimedia-operations) [2024-06-20T15:46:50Z] <hnowlan@cumin1002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T357309)

Mentioned in SAL (#wikimedia-operations) [2024-06-20T15:54:29Z] <hnowlan@cumin1002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T357309)

Mentioned in SAL (#wikimedia-operations) [2024-06-20T15:59:48Z] <hnowlan@cumin1002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T357309)

Mentioned in SAL (#wikimedia-operations) [2024-06-20T16:07:48Z] <hnowlan@cumin1002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T357309)

Change #1048387 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: set shellbox-video to production

https://gerrit.wikimedia.org/r/1048387

Change #1048387 merged by Hnowlan:

[operations/puppet@production] service: set shellbox-video to production

https://gerrit.wikimedia.org/r/1048387

Change #1043817 merged by Hnowlan:

[operations/dns@master] Add shellbox-video discovery

https://gerrit.wikimedia.org/r/1043817

Change #1047523 merged by Hnowlan:

[operations/puppet@production] shellbox-video: set timeout to one day

https://gerrit.wikimedia.org/r/1047523

Change #1049963 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] LabsServices: add port for shellbox-video

https://gerrit.wikimedia.org/r/1049963

Change #1049970 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] testwiki: use shellbox-video for scaling video

https://gerrit.wikimedia.org/r/1049970

hnowlan claimed this task.
hnowlan added a subscriber: kamila.

I'm sure there'll be some tweaks further down the road, but this deployment has been created. Tracking further work in T356241