Reportupdater is one of Data Engineering's scheduling tools that we want to migrate to Airflow.
It generates a set of reports for low-risk datasets, like the UserAgent breakdowns: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater
https://github.com/wikimedia/analytics-reportupdater
https://github.com/wikimedia/analytics-reportupdater-queries
Basically, reportupdater executes a given HQL query on a schedule and appends the results to a TSV report file.
Expected result
A HiveToTSVOperator (or another solution!) that given an HQL query, updates a report file with its results. Not sure if this task can be accomplished by using HQL only (temporary external table on top of the TSV report?) or we will need a Scala job that transforms the given query into a DataFrame and then updates the report file. We'll need to test this Operator with real data, so we can migrate 1 of reportupdater jobs as part of this task. Later it will be trivial to migrate the other ones.
Gotchas
- Some queries return 1 single row, whereas others can return multiple rows.
- We should implement the 'max_data_points' feature (see docs).
- We should not implement the 'explode_by' feature (see docs). This should be implemented in the DAG file, by looping over the list of values to explode and creating a HiveToTSVOperator for each.
- We should not implement the 'graphite' feature. There's another Operator for that.