Page MenuHomePhabricator

[video2commons] Find a more stable (or self-healing) VM/celery setup
Open, HighPublicBUG REPORT

Assigned To
Authored By
mikez-WMF
Oct 15 2025, 2:58 PM
Referenced Files
Restricted File
Nov 23 2025, 2:04 PM
Restricted File
Nov 23 2025, 2:04 PM
F70503937: image.png
Nov 23 2025, 2:01 PM

Description

Steps to replicate the issue / What happens?:

The v2c* backend currently has 6 VMs, on which there are 4 celery workers. The setup works fine for a few weeks/months, then workers begin to fail one by one until only one or two VMs remain working and receive all the requests, causing them to overload and crash the entire system.

What should have happened instead?:

The setup should be more stable by working for far more than a few weeks/months with workers not failing and leaving a few to become overloaded.

*https://github.com/toolforge/video2commons

Details

Other Assignee
Chicocvenancio

Event Timeline

Chicocvenancio added subscribers: DonVi, Chicocvenancio.

@DonVi We had a workaround for this in the old instances with a celery healthcheck. I can create it again

Fix applied on all encoder instances, let's see in the next months if it works: https://github.com/toolforge/video2commons/pull/252

Just because I'm new and unaware of who's who - are you folks working with Simone on this?

@Cparle I think we should arrange a presentation of everyone involved in v2c :)

I'm a new maintainer since last year, but unwilling to commit in the long term that's why I and others called WMF for help.

@Don-vip thank you for providing context - I was unaware since I'm new at WMF. We appreciate the merge and time will tell! We had contracted Simone to help out here, but maybe he can still take a look and provide feedback or just move onto the other v2c issues

@SimoneThisDot Apologies for the mix-up! Now we know an expert who might be able to help advise on other issues that haven't been started though.

Also happy to connect you both 1:1.

The system is still unstable, three out of six instances became unresponsive this week and I had to restart them manually. I have no idea what happened, the instances stopped reporting metrics to prometheus/grafana and I couldn't ssh to them, so I restarted them from Horizon.

Don-vip triaged this task as High priority.Nov 19 2025, 6:14 PM

@Amdrel - is this something you can look into after @Don-vip grants you maintainer rights to investigate?

More context from @Don-vip as we just had a separate exchange about this:

"First, we have not yet stabilized the platform. I had to restart manually two instances that became unresponsive in the past two weeks, I don't know why. And right now v2c is severely degraded, on the verge on completely break down. This can be seen in Grafana:

https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&from=now-7d&to=now&timezone=utc&var-project=video&var-instance=$__all

Instances are failing one after the other, I have no idea why."

@Amdrel are you registered to https://toolsadmin.wikimedia.org ? I don't find you when I try to grant you maintainer role.

All instances crashed today, I'm restarting them manually in Horizon.

instance encoding04 didn't restart, I'm trying again. Other have restarted and I can see there is a lot of stale output files in /srv/v2c/output:

encoding01: 26G
encoding02: 41G
encoding03: 45G
encoding05: 10G
encoding06: 34G

I'm cleaning all of it to reclaim free space.

Everything should be up and running again.

And everything crashed again :( Looks like a memory issue:

{F70504144}

image.png (296×234 px, 12 KB)

{F70504179}

EDIT: restarted manually instances 2, 3, 4 and 5.

Has this gotten worse in the last couple of weeks? AV1 transcoding uses a lot more memory than VP9 encoding, so maybe that can be causing this.

Yes, it was more stable before, so it's probably linked to this change. I'll try to reduce the number of workers per instance.

Hopefully that works. If it does, longer term should we reconsider having AV1 as the default transcoding target? At least while we only have 16 GiB of memory. Having to limit how many workers we have due to this change could potentially back up the tool a lot. If we change the default or hide the option AV1 videos in an MP4 container should still upload fine since the video data can be remuxed by ffmpeg, which shouldn't take as much memory, though I'd like to verify that. I believe most recently uploaded YouTube videos at or over 1080p in size use AV1 with either an MP4 or WebM container.

Right now the number of workers was above our needs I think. I doubled the number of instances when I performed the last Debian migration, and kept this number since then, as the platform was previously often overloaded. Let's see how it goes when reducing from 4 to 3 workers per instance: https://github.com/toolforge/video2commons/pull/267

(deploying the change right now)

One thing that should help a lot is a better dispatch of tasks across instances. It seems to me that the new tasks are often picked up by encoding01, even if one of its workers is currently trnscoding a video while other instances are chilling. I'm not 100% sure about that, but we need to check that the load is evenly distributed across the instances.

One thing that should help a lot is a better dispatch of tasks across instances. It seems to me that the new tasks are often picked up by encoding01, even if one of its workers is currently trnscoding a video while other instances are chilling. I'm not 100% sure about that, but we need to check that the load is evenly distributed across the instances.

I looked into this and I think this may be due to a default configuration value causing workers to accept more tasks than they can process. Setting worker_prefetch_multiplier to 1 (default 4) will probably solve this problem. Here's what the documentation says:

worker_prefetch_multiplier

Default: 4.

How many messages to prefetch at a time multiplied by the number of concurrent processes. The default is 4 (four messages for each process). The default setting is usually a good choice, however – if you have very long running tasks waiting in the queue and you have to start the workers, note that the first worker to start will receive four times the number of messages initially. Thus the tasks may not be fairly distributed to the workers.

To disable prefetching, set worker_prefetch_multiplier to 1. Changing that setting to 0 will allow the worker to keep consuming as many messages as it wants.

I've opened a pull request on GitHub that sets the multiplier to 1: https://github.com/toolforge/video2commons/pull/268

Edit: In my local testing with 3 workers with a concurrency of 2, starting 6 tasks all at the same time resulted in an even spread across all of the workers.

@Amdrel are you registered to https://toolsadmin.wikimedia.org ? I don't find you when I try to grant you maintainer role.

My request to join toolsadmin has been approved. Could you please try again? Thank you!

I've gained access and I noticed a couple of things so far. The new cron job and old cron jobs aren't running due to the following errors:

Dec 04 21:18:04 encoding02 cron[826]: (CRON) INFO (Running @reboot jobs)
Dec 04 21:18:04 encoding02 cron[826]: (*system*v2chealthcheck) ERROR (Syntax error, this crontab file will be ignored)
Dec 04 21:18:04 encoding02 cron[826]: Error: bad username; while reading /etc/cron.d/v2chealthcheck
Dec 04 21:18:04 encoding02 cron[826]: (*system*v2ccleanup) ERROR (Syntax error, this crontab file will be ignored)
Dec 04 21:18:04 encoding02 cron[826]: Error: bad username; while reading /etc/cron.d/v2ccleanup
Dec 04 21:18:04 encoding02 cron[826]: (CRON) INFO (pidfile fd = 3)
Dec 04 21:18:04 encoding02 systemd[1]: Started cron.service - Regular background program processing daemon.
-- Boot 15260fb575cd45f8a2bbf562f71d7dc8 --

After restarting cron.service the failed crontabs were able to load without this error and execute again. I suspect cron.service is starting before whatever service permits LDAP logins (sssd)? The failed crontabs use the tools.video2commons user, which isn't a local user. My suspicion is there's a race.

Also, I noticed the path to the healthcheck script is incorrect and that specific crontab will fail regardless of the aforementioned issue:

tools.video2commons@encoding01:/srv/v2c$ /bin/sh /srv/v2c/video2commons/utils/healthcheck.sh
/bin/sh: 0: cannot open /srv/v2c/video2commons/utils/healthcheck.sh: No such file

It should be /srv/v2c/utils/healthcheck.sh instead.

After looking into some logs deeper I can see the OOM killer has been getting invoked around the time instances stop reporting to the dashboard. I also saw several logs that indicated network errors afterwards as well, though the instance did remain up. I see lots of prometheus write errors after the OOM killer got to work.

AV1 transcoding can be a bit of a memory hog. encoding04 is currently processing a single job and it's using 12.0G of memory (RES). The video being converted is a 4k Apple ProRes 422 file. Processing just two of these files concurrently would result in OOM errors.

There's no way to limit the memory usage to a predictable amount when calling ffmpeg?

Not directly, however there's a couple of things we can do. We can tweak settings sent to the encoder such as look ahead distance and see if that results in less memory being used, but that may have an effect on output quality and file size. We could also try lowering the amount of threads for each ffmpeg process we spawn, and I think this is what we should try first. I noticed that we're currently creating 16 threads per process (matches core count, default for libsvtav1), which is a bit excessive if there's multiple tasks being processed at once. The way it's currently configured results in 48 threads competing for 16 CPU cores for a worker that has 3 tasks being processed. More threads result in more memory being used. We're using more memory than needed on worker nodes that are processing more than 1 video concurrently.

It seems quite stable now, thank you @Amdrel !