[Airflow Migration] Migrate reportupdater jobs
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	mforns
	May 4 2022, 5:28 AM

Description

Reportupdater is one of Data Engineering's scheduling tools that we want to migrate to Airflow.
It generates a set of reports for low-risk datasets, like the UserAgent breakdowns: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater
https://github.com/wikimedia/analytics-reportupdater
https://github.com/wikimedia/analytics-reportupdater-queries
Basically, reportupdater executes a given HQL query on a schedule and appends the results to a TSV report file.

Expected result

A HiveToTSVOperator (or another solution!) that given an HQL query, updates a report file with its results. Not sure if this task can be accomplished by using HQL only (temporary external table on top of the TSV report?) or we will need a Scala job that transforms the given query into a DataFrame and then updates the report file. We'll need to test this Operator with real data, so we can migrate 1 of reportupdater jobs as part of this task. Later it will be trivial to migrate the other ones.

Gotchas

Some queries return 1 single row, whereas others can return multiple rows.
We should implement the 'max_data_points' feature (see docs).
We should not implement the 'explode_by' feature (see docs). This should be implemented in the DAG file, by looping over the list of values to explode and creating a HiveToTSVOperator for each.
We should not implement the 'graphite' feature. There's another Operator for that.

Details

	Subject	Repo	Branch	Lines +/-
	Remove the last of the reportupdater resources in puppet	operations/puppet	production	+0 -147
	Absent all report-updater jobs	operations/puppet	production	+16 -40

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T307500 Airflow Hackathon (May 2022)
Open	None	T307540 [Airflow Migration] Migrate reportupdater jobs
Resolved	awight	T333537 Deprecate WMDE Technical Wishes reportupdater jobs
Resolved	amastilovic	T354552 [Maintenance] Migrate ReportUpdater browser queries to Airflow
Resolved	amastilovic	T356424 [Maintenance] Migrate cx ReportUpdater job
Resolved	amastilovic	T357372 [Maintenance] Migrate pingback to Airflow
Resolved	JAllemandou	T357419 Turn off ReportUpdater jobs no longer used
Open	None	T358210 Delete reportupdater jobs data/puppet-code
Resolved	amastilovic	T357938 [Maintenance] Migrate wmcs to Airflow
Open	mforns	T364579 Deprecate reportupdater documentation

Event Timeline

mforns created this task.May 4 2022, 5:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 4 2022, 5:28 AM

mforns added a parent task: T307500: Airflow Hackathon (May 2022).May 4 2022, 5:29 AM

mforns moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.May 11 2022, 6:45 PM

mforns claimed this task.May 23 2022, 3:46 PM

mforns moved this task from Backlog to Estimated on the Data Pipelines board.

mforns moved this task from Estimated to Discussed (Radar) on the Data Pipelines board.Jun 6 2022, 3:38 PM

• EChetty moved this task from Discussed (Radar) to Estimated on the Data Pipelines board.Jun 6 2022, 3:38 PM

mforns moved this task from Estimated to Discussed (Radar) on the Data Pipelines board.Jun 6 2022, 3:38 PM

• EChetty moved this task from Discussed (Radar) to Backlog on the Data Pipelines board.Jun 30 2022, 3:08 PM

BTullis mentioned this in T205296: Migrate all reportupdater queries to hive.Mar 29 2023, 3:22 PM

BTullis mentioned this in T308998: Investigate CPU usage on an-launcher1002.

JArguello-WMF moved this task from Transform to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 11:05 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:53 PM

• lbowmaker moved this task from Data Eng Backlog to Airflow Migration on the Data Engineering and Event Platform Team board.Sep 29 2023, 1:49 PM

Ahoelzl renamed this task from Migrate 1+ reportupdater jobs to [Airflow Migration] Migrate 1+ reportupdater jobs.Oct 20 2023, 5:07 PM

• lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:20 PM

• lbowmaker moved this task from Event Platform Backlog to Airflow Migration on the Data-Engineering board.

• lbowmaker added a subtask: T356424: [Maintenance] Migrate cx ReportUpdater job.Feb 13 2024, 12:59 AM

• lbowmaker closed subtask T357419: Turn off ReportUpdater jobs no longer used as Resolved.Mar 6 2024, 9:41 PM

• lbowmaker closed subtask T356424: [Maintenance] Migrate cx ReportUpdater job as Resolved.Apr 1 2024, 12:05 PM

JAllemandou renamed this task from [Airflow Migration] Migrate 1+ reportupdater jobs to [Airflow Migration] Migrate reportupdater jobs.Apr 26 2024, 10:20 AM

Change #1024614 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Absent all report-updater jobs

https://gerrit.wikimedia.org/r/1024614

gerritbot added a project: Patch-For-Review.Apr 26 2024, 10:21 AM

Change #1024614 merged by Btullis:

[operations/puppet@production] Absent all report-updater jobs

https://gerrit.wikimedia.org/r/1024614

Mentioned in SAL (#wikimedia-analytics) [2024-04-26T11:08:21Z] <btullis> removed the symlink /srv/published/datasets/periodic/reports on an-launcher1002 to cease publishing reportupdata jobs from this host (T307540)

Reportupdater jobs have all been either deprecated or migrated to Airflow!
The report-updater jobs have been stopped, and data-synchronization have been updated from the report-updater folders to hadoop-folders (updated by airflow jobs).
We can call report-updater deprecated for real, even if we still need to do some code cleaning.
Also, if anything goes wrong with the new system, we still have the data generated by reportupdater stored on HDFS and we can reset the old system.
This is a great step toward not using other scheduler than airflow - Lots of kudos to @amastilovic for migrating the jobs, and to @BTullis for finalizing the operations on deprecating the tool.