Currently some jobs are too limited by spark.dynamicAllocation.maxExecutors=16:
- the spark task in aqs hourly is taking ~1.5min, but it took ~30s in Hive
- mediarequest hourly is taking 3.5min
- app_session_metrics is taking an hour. And this may be the reason why the skein log collector is crashing. The current fix is here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/9fa5d7e003c86785ba5149642eaec9a0d5bee596
The maxExecutors configuration is located here:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/dag_default_args.py#L100
... is well propagated to Skein. But we should adapt it to each dag.
Concerning the 3 jobs upper, we could set the value to 64. Lets review the others.