Page MenuHomePhabricator

[Airflow Migration] Migrate reportupdater jobs
Open, Needs TriagePublic

Description

Reportupdater is one of Data Engineering's scheduling tools that we want to migrate to Airflow.
It generates a set of reports for low-risk datasets, like the UserAgent breakdowns: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater
https://github.com/wikimedia/analytics-reportupdater
https://github.com/wikimedia/analytics-reportupdater-queries
Basically, reportupdater executes a given HQL query on a schedule and appends the results to a TSV report file.

Expected result

A HiveToTSVOperator (or another solution!) that given an HQL query, updates a report file with its results. Not sure if this task can be accomplished by using HQL only (temporary external table on top of the TSV report?) or we will need a Scala job that transforms the given query into a DataFrame and then updates the report file. We'll need to test this Operator with real data, so we can migrate 1 of reportupdater jobs as part of this task. Later it will be trivial to migrate the other ones.

Gotchas
  • Some queries return 1 single row, whereas others can return multiple rows.
  • We should implement the 'max_data_points' feature (see docs).
  • We should not implement the 'explode_by' feature (see docs). This should be implemented in the DAG file, by looping over the list of values to explode and creating a HiveToTSVOperator for each.
  • We should not implement the 'graphite' feature. There's another Operator for that.

Event Timeline

mforns moved this task from Backlog to Estimated on the Data Pipelines board.
Ahoelzl renamed this task from Migrate 1+ reportupdater jobs to [Airflow Migration] Migrate 1+ reportupdater jobs.Oct 20 2023, 5:07 PM
JAllemandou renamed this task from [Airflow Migration] Migrate 1+ reportupdater jobs to [Airflow Migration] Migrate reportupdater jobs.Fri, Apr 26, 10:20 AM

Change #1024614 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Absent all report-updater jobs

https://gerrit.wikimedia.org/r/1024614

Change #1024614 merged by Btullis:

[operations/puppet@production] Absent all report-updater jobs

https://gerrit.wikimedia.org/r/1024614

Mentioned in SAL (#wikimedia-analytics) [2024-04-26T11:08:21Z] <btullis> removed the symlink /srv/published/datasets/periodic/reports on an-launcher1002 to cease publishing reportupdata jobs from this host (T307540)

Reportupdater jobs have all been either deprecated or migrated to Airflow!
The report-updater jobs have been stopped, and data-synchronization have been updated from the report-updater folders to hadoop-folders (updated by airflow jobs).
We can call report-updater deprecated for real, even if we still need to do some code cleaning.
Also, if anything goes wrong with the new system, we still have the data generated by reportupdater stored on HDFS and we can reset the old system.
This is a great step toward not using other scheduler than airflow - Lots of kudos to @amastilovic for migrating the jobs, and to @BTullis for finalizing the operations on deprecating the tool.

Change #1034477 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the last of the reportupdater resources in puppet

https://gerrit.wikimedia.org/r/1034477