On 2025-01-09, we changed the Spark deployment mode for the airflow skein launcher operator from cluster to client.
Since then, we have observed flaky behavior in the feature collection task. It appears that Airflow fails to correctly retrieve the Skein launcher application state. Airflow mistakenly believes the Skein launcher is down when it is actually running. As a result, it repeatedly re-launches the application, leading to multiple instances of the same Skein and Spark application running simultaneously. Eventually, the task is marked as failed.
Acceptance Criteria (AC):
- Identify the root cause of the communication issues between Airflow and Skein.
- Ensure the feature collection task completes successfully.
- Guarantee that only one instance of the feature collection application is in the RUNNING state at any given time.