Refine jobs should be scheduled by Airflow
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	mforns
	May 3 2022, 8:34 PM

Description

One of the systems that schedules DE's jobs today is the Refine pipeline.
Documentation: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine
Code: https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine

NOTE: We should probably do T341229: ProduceCanaryEvents job should be scheduled by Airflow and/or a k8s service first as it will help us learn more about how we'll dynamically schedule with Airflow.

Short context

1 Refine job processes a dynamic number of datasets, all included under the same HDFS base path. It has 3 different aspects:

Identification of refine targets. Determining which datasets within the base path need to be refined.
Schema evolution. Making changes to output Hive tables if schemas have changed.
Data refinement. Actually processing the input data and writing to the output table.

Only the identification of refine targets (1) has to be migrated to Airflow. 2) and 3) will still be executed by Refine.

Expected result

The ideal result of this task would be an Airflow factory (i.e. a TaskGroup factory) that dynamically generates a DAG for all the target datasets (or 1 DAG for each target dataset). For each hour and dataset, the DAG would execute a SparkSubmitOperator that would call Refine. This way, we could very easily migrate an existing Refine job, just by calling the TaskGroup factory with some configuration.

Gotchas

The main issue is that the source data for the Refine pipeline can be updated after it being created. For example: An hourly partition for a given source dataset might be created at 5pm, and at 6pm it might be rewritten (updated) to include some more data it was missing. Refine works this around by checking the modification time of the source dataset and the output dataset, and it re-refines if the source mtime is greater than the destination mtime. One big part of this task is to figure out how to implement this in Airflow!
A problem that has bugged us in the past is the pyarrow library (used to interact with HDFS from Airflow). Its older version was not thread-safe and caused us problems when creating dynamic DAGs. We upgraded to the newest pyarrow library (which is supposed to fix our issues), but have not yet extensively tested it. This might be another potential blocker of this task.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T291464 Upgrade analytics-hadoop to Spark 3 + scala 2.12
Duplicate		None	T291465 Analytics-test-hadoop Spark3 package upgrade
Duplicate		None	T291466 Analytics-hadoop Spark3 package upgrade (production)
Resolved		JAllemandou	T306955 Spark3 migration - Currently existing airflow jobs
Open		None	T291386 Upgrade Refinery Jobs to Spark 3
Resolved		None	T307500 Airflow Hackathon (May 2022)
Open		None	T307505 Refine jobs should be scheduled by Airflow
Duplicate		None	T312785 Change the way Refine handles its status (currently flags in partitions)
Resolved		Antoine_Quhen	T356192 [Refine refactoring] Refactor and migrate navigationtiming to Airflow
Resolved		JAllemandou	T356363 [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration
Resolved		tchin	T355542 [Dataset Config Store] - Define config API for navigationtiming and implement local development instance
Resolved		Antoine_Quhen	T356362 [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow
Open		Antoine_Quhen	T356762 [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation
Resolved		gmodena	T361853 [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator
Resolved	Spike	gmodena	T361017 [SPIKE] Can we express Event Platform configs in Datasets Config?
Resolved		Ottomata	T366627 [MPIC] Analyse risk of potential performance issues with static approach to stream configuration
Resolved		Antoine_Quhen	T365223 Fix generation of _IMPORTED flags by Gobblin
Resolved		Stevemunene	T365449 Upgrade Airflow to 2.9.3
			Restricted Task
Open		Antoine_Quhen	T357430 Airflow mapped tasks UI & metrics
Open		Antoine_Quhen	T365563 Timeout hive-metastore locks
Resolved		tchin	T367134 [Refine Refactoring] Changes to EventStreamConfig needed for scheduling Refine via airflow
Open		JAllemandou	T370665 Handle Late-Arrived Events from Gobblin into Airflow triggered Refine
Open		None	T371803 Refine optimizations on output and parallelization
Open		None	T375064 Move more of refine_hive_hourly dag logic into RefineConfiguration
Open		None	T255818 Refine drops $schema field values
Open		None	T259924 HiveExtensions.convertToSchema does not properly convert arrays of structs
Open		tchin	T366487 Event Platform schemas should not support type changes to structs as array element or map value types
Resolved		Antoine_Quhen	T361502 [Refine Refactoring] Define and implement a automated testing / comparison tool for config store configured datasets
Open		Antoine_Quhen	T369845 [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment
Open		None	T377739 [Refine Refactoring] Refine Data Quality - late events, RefineMonitor refactor, etc.

Event Timeline

mforns created this task.May 3 2022, 8:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2022, 8:34 PM

mforns added a parent task: T307500: Airflow Hackathon (May 2022).May 3 2022, 8:34 PM

mforns moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.May 11 2022, 6:45 PM

mforns moved this task from Backlog to Estimated on the Data Pipelines board.May 23 2022, 3:46 PM

mforns assigned this task to • NOkafor-WMF.May 23 2022, 3:49 PM

mforns moved this task from Estimated to Discussed (Radar) on the Data Pipelines board.Jun 6 2022, 3:38 PM

• EChetty moved this task from Discussed (Radar) to Backlog on the Data Pipelines board.Jun 30 2022, 3:08 PM

Resetting inactive assignee. Please reassign tasks when offboarding - thanks.)

Migrating refine from Airflow may trigger upgrading the refine jobs to Spark 3.

The last version of the refinery source includes more error logs, which will ship at the same time:

JArguello-WMF moved this task from Transform to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 11:21 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:53 PM

Ottomata mentioned this in T341229: ProduceCanaryEvents job should be scheduled by Airflow and/or a k8s service.Jul 6 2023, 1:51 PM

Ottomata renamed this task from Migrate 1+ Refine jobs to Refine jobs should be scheduled by Airflow.Jul 6 2023, 1:55 PM

Ottomata updated the task description. (Show Details)

Ottomata added a subtask: T312785: Change the way Refine handles its status (currently flags in partitions).

Ottomata edited subscribers, added: Milimetric, JAllemandou; removed: • NOkafor-WMF.

Ottomata merged tasks: T296534: Deprecate Refine Scheduler , T296523: Refine to Airflow Migration: User Story, T296532: Complete Refine to Airflow Migration (100%), T296531: Migrate the Medium Risk Refine Jobs to Airflow (50%), T296530: Migrate the Selected Refine Jobs to Airflow (1%).Jul 6 2023, 1:57 PM

Ottomata added subscribers: • EChetty, Ottomata.

BTullis mentioned this in T330652: Investigate and fix dconf errors shown by services run as the analytics system user.Jul 17 2023, 10:10 AM

• lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:50 PM

• lbowmaker moved this task from Event Platform Backlog to Incoming (new tickets) on the Data-Engineering board.

• lbowmaker moved this task from Incoming (new tickets) to Airflow Migration on the Data-Engineering board.Jan 8 2024, 8:19 PM

Ahoelzl added a subtask: T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration.Feb 5 2024, 11:51 PM

Ahoelzl added a subtask: T355542: [Dataset Config Store] - Define config API for navigationtiming and implement local development instance.

Ahoelzl added a subtask: T356362: [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow.

• lbowmaker closed subtask T355542: [Dataset Config Store] - Define config API for navigationtiming and implement local development instance as Resolved.Feb 16 2024, 3:21 PM

Antoine_Quhen closed subtask T356192: [Refine refactoring] Refactor and migrate navigationtiming to Airflow as Resolved.Mar 26 2024, 4:17 PM

We (Data-Platform-SRE) have been working on updating the alerting system so that all emails sent by automated monitoring systems use routable domains. This work is being carried out under T358675: Update the From: addresses of all email from DPE pipelines so that they use routable addresses

With regard to refinery, this is happening as a two stage process:
Firstly, I updated all of the references in puppet to the systemd timers that launch refinery jobs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014001 - adding options to override the default email address.
This is deployed and working.

I have also created a patch to refinery-source itself, here: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1014004
This should update the default email address to be the same as that which I have set on every instantiated job via puppet.
That patch is still awaiting code review.

So in the meantime, if you start to run any refine jobs from Ariflow, you may find that they come from refine@an-launcher1002.eqiad.wmnet instead of noreply@wikimedia.org.
You shoul be able to override the from_email for these jobs with configuraiton parameters, in the same way as the systemd timers are currently doing.

BTullis mentioned this in T358675: Update the From: addresses of all email from DPE pipelines so that they use routable addresses.Mar 27 2024, 3:02 PM

Ottomata added a subtask: T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.May 28 2024, 6:47 PM

Ottomata added a subtask: T367134: [Refine Refactoring] Changes to EventStreamConfig needed for scheduling Refine via airflow.Jun 21 2024, 7:24 PM

@Antoine_Quhen can you remind me? How did we over come the 'gotcha' described in the task description here?

The main issue is that the source data for the Refine pipeline can be updated after it being created. For example: An hourly partition for a given source dataset might be created at 5pm, and at 6pm it might be rewritten (updated) to include some more data it was missing. Refine works this around by checking the modification time of the source dataset and the output dataset, and it re-refines if the source mtime is greater than the destination mtime. One big part of this task is to figure out how to implement this in Airflow!

Ottomata added a subtask: T361502: [Refine Refactoring] Define and implement a automated testing / comparison tool for config store configured datasets.Jul 3 2024, 6:17 PM

Ottomata added a subtask: T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment.Jul 11 2024, 8:05 PM

• lbowmaker closed subtask T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration as Resolved.Jul 12 2024, 3:05 PM

Ottomata closed subtask T356362: [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow as Resolved.Aug 12 2024, 1:42 PM

Ottomata mentioned this in T356362: [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow.

How did we over come the 'gotcha' described in the task description here?

The main issue is that the source data for the Refine pipeline can be updated after it being created

Answering for posterity: This is not yet overcome: T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine

Ahoelzl closed subtask T361502: [Refine Refactoring] Define and implement a automated testing / comparison tool for config store configured datasets as Resolved.Oct 9 2024, 1:58 PM

Ahoelzl closed subtask T367134: [Refine Refactoring] Changes to EventStreamConfig needed for scheduling Refine via airflow as Resolved.