Page MenuHomePhabricator

Migrate the airflow-search scheduler to Kubernetes
Closed, ResolvedPublic

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Intial build of the refinery container imagerepos/data-engineering/refinery!1btullisadd_refinery_blubbermain
Customize query in GitLab

Event Timeline

Gehel triaged this task as High priority.Nov 25 2024, 1:33 PM

Issues were discovered post-migration, and are being worked on in this document.

Issues were discovered post-migration, and are being worked on in this document.

By way of a quick update, there were 3 issues discovered post-migration.

  1. The search team were still using the HiveOperator for one of their DAGs, but our Airflow image does not include the hive CLI.
  2. Some of the DAGs (e.g. drop_old_data_daily) were using the BashOperator were attempting to run some refinery scripts, specifically refinery-drop-older-than. These scripts are not available in our Airflow image.
  3. Some of the DAGs that used the SparkOperator were failing and the task logs just showed confusing errors that the jobs had been killed.

We have largely addressed all of these now, although 2 won't be fixed until next week, at the earliest.

Fixes:

  1. @dcausse switched the popularity_score DAG to spark-sql
  2. We have decided to T383417: Create a container image for analytics/refinery to be used with Airflow tasks and launch this with the KubernetesPodOperator to do these tasks
  3. We ascertained that the issue was primarily caused by the use of cluster mode when launching the spark jobs. The recommended approach is to use client mode which means that the spark driver log is retained, inline with the skein job log and is therefore available in the Airflow UI. This was updated and now we are monitoring for stability.

Change #1111230 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Allow the scheduler to patch existing pods

https://gerrit.wikimedia.org/r/1111230

Change #1111230 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Allow the scheduler to patch existing pods

https://gerrit.wikimedia.org/r/1111230

Change #1111237 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: revert the change to the kube-api networkpolicy

https://gerrit.wikimedia.org/r/1111237

Change #1111237 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: revert the change to the kube-api networkpolicy

https://gerrit.wikimedia.org/r/1111237

The airflow-search scheduler is fully migrated to Kubernetes. Per IRC conversation with @dcausse , all outstanding issues have been resolved. As such, I'm closing out this ticket. Feel free to re-open if we missed anything.