Page MenuHomePhabricator

Mjolnir failures in feature collection task
Closed, ResolvedPublic5 Estimated Story Points

Description

On 2025-01-09, we changed the Spark deployment mode for the airflow skein launcher operator from cluster to client.

Since then, we have observed flaky behavior in the feature collection task. It appears that Airflow fails to correctly retrieve the Skein launcher application state. Airflow mistakenly believes the Skein launcher is down when it is actually running. As a result, it repeatedly re-launches the application, leading to multiple instances of the same Skein and Spark application running simultaneously. Eventually, the task is marked as failed.

Acceptance Criteria (AC):

  • Identify the root cause of the communication issues between Airflow and Skein.
  • Ensure the feature collection task completes successfully.
  • Guarantee that only one instance of the feature collection application is in the RUNNING state at any given time.

Related

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
hooks: skein: log application final status.repos/data-engineering/airflow-dags!1019gmodenaskein-log-statusmain
Customize query in GitLab

Event Timeline

Gehel set the point value for this task to 5.Jan 13 2025, 4:52 PM

This issue was caused by multiple instances of the same spark job causing a race condition in kafka.

ACs have been indirectly me by fixes we rolled out during the airflow scheduler migration to k8s.

Gehel claimed this task.
Gehel moved this task from Done to Reported on the Discovery-Search (2025.02.10 - 2025.02.28) board.