Page MenuHomePhabricator

[Airflow Migration] Migrate 1+ reportupdater jobs
Open, Needs TriagePublic

Description

Reportupdater is one of Data Engineering's scheduling tools that we want to migrate to Airflow.
It generates a set of reports for low-risk datasets, like the UserAgent breakdowns: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater
https://github.com/wikimedia/analytics-reportupdater
https://github.com/wikimedia/analytics-reportupdater-queries
Basically, reportupdater executes a given HQL query on a schedule and appends the results to a TSV report file.

Expected result

A HiveToTSVOperator (or another solution!) that given an HQL query, updates a report file with its results. Not sure if this task can be accomplished by using HQL only (temporary external table on top of the TSV report?) or we will need a Scala job that transforms the given query into a DataFrame and then updates the report file. We'll need to test this Operator with real data, so we can migrate 1 of reportupdater jobs as part of this task. Later it will be trivial to migrate the other ones.

Gotchas
  • Some queries return 1 single row, whereas others can return multiple rows.
  • We should implement the 'max_data_points' feature (see docs).
  • We should not implement the 'explode_by' feature (see docs). This should be implemented in the DAG file, by looping over the list of values to explode and creating a HiveToTSVOperator for each.
  • We should not implement the 'graphite' feature. There's another Operator for that.

Event Timeline

mforns moved this task from Backlog to Estimated on the Data Pipelines board.
Ahoelzl renamed this task from Migrate 1+ reportupdater jobs to [Airflow Migration] Migrate 1+ reportupdater jobs.Oct 20 2023, 5:07 PM