Since October 2024 we started to see issues with the feature selection task.
The spark job seems to get stuck in the approxQuantile stage apparently doing nothing...
The cause is not clear at this point and restarting the task might succeed, what is particularly problematic is that the job never ends and can keep running for several days until an operator kills it and it's configured to use the sequential pool it is blocking other unrelated tasks using this same pool in search airflow instance.
AC:
- mitigate the issue by putting mjolnir in its own pool
- try to understand why it gets stuck and fix the cause
- make sure this task does not run forever and prefer failing if it's running for more than X hours (X to be determined but usually this task takes between 1h30 & 2h00 to run)