Page MenuHomePhabricator

[Maintenance] Resolve long launch times for canary events on Airflow (30mins in total)
Closed, ResolvedPublic3 Estimated Story Points

Description

Canary events are newly scheduled hourly on Airflow, however a single events takes ~30s to launch due to Skein scheduling overheads. All events in a given hour take up to 30 mins. to complete blocking scheduling resources on the cluster.

Potential solution: Yarn queues with up to 20 parallel jobs at a time.
This won't reduce the individual launch time, but would bring the overall elapsed time significantly down.

Event Timeline

Ahoelzl renamed this task from Resolve long launch times for canary events on Airflow (30mins in total) to [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total).Apr 1 2024, 5:47 PM
Ahoelzl set the point value for this task to 3.

Change #1019683 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update yarn scheduler's queues configuration

https://gerrit.wikimedia.org/r/1019683

Change #1019683 merged by Btullis:

[operations/puppet@production] Update yarn scheduler's queues configuration

https://gerrit.wikimedia.org/r/1019683

Mentioned in SAL (#wikimedia-analytics) [2024-04-18T11:41:33Z] <btullis> adding new 'launchers' yarn queue and renaming 'fifo' to 'gpus' for T361499

Mentioned in SAL (#wikimedia-analytics) [2024-04-18T14:07:28Z] <btullis> restarted the hadoop-yarn-resourcemanager.service on an-master100[3-4] to pick up new queue settings for T361499

Global execution times have been divided by 3 (10mins for 170 jobs). We are using a new launchers queue to launch small jobs and have scaled the airflow parallelization to 10 tasks. We can replicate this model to other jobs :)