Page MenuHomePhabricator

Refactor pingback reports pipelines using dbt
Open, Needs TriagePublic

Description

The current architecture of the pingback reports pipelines is in need of a proper refactor, in order to make maintenance and further development easier and less error-prone.

The implementation right now consists of a weekly Airflow DAG running HQL queries as tasks for each report. The HQL queries themselves work not on Hive/Iceberg tables but on CSV files - loading up the latest CSV report, adding the latest weekly results, and then writing back to the same CSV report.

This kind of implementation does work, but it is quite hard to maintain as CSV files aren't as flexible as DB tables. Some reports are based on software versions, so from time to time we need to add support for a new version that does not exist in the CSV report. Instead of simply adding the new version in the Airflow DAG, the maintainer has to:

  1. Add the new version in the Airflow DAG
  2. Download the existing CSV report, manually add a new empty CSV column with the appropriate header, and re-upload the CSV report back to HDFS

This complexity and brittleness of the code could well be avoided by simply moving the reports to Hive/Iceberg tables, letting the SQL code work on those tables (and add new columns if needed), and then exporting the table to a CSV report.

Since we are currently evaluating the use of the dbt tool, let's take this opportunity to refactor pingback reports pipelines into dbt DAGs run by Airflow.

Done is:

  • Pingback reports are converted into dbt models materialized as DB tables
  • Investigate whether automatically adding new columns to report tables is possible (on_schema_change setting append_new_columns or sync_new_columns)
  • dbt models are scheduled to run as Airflow dbt DAG/tasks
  • CSV exports of these report tables are also scheduled after the weekly runs

Event Timeline

Nice! Just wondering, do we know who uses that output CSV and where?

I'm asking for my own education (I don't know much about pingback), but I'm also wondering: do we actually want to produce a CSV, or can migrate its readers to reading a table at the end of this work?

Nice! Just wondering, do we know who uses that output CSV and where?

I'm asking for my own education (I don't know much about pingback), but I'm also wondering: do we actually want to produce a CSV, or can migrate its readers to reading a table at the end of this work?

I'm not quite sure, but I think it's being used by people outside the foundation. See https://phabricator.wikimedia.org/T365201 for example, that was a bug report that inspired the push to refactor the pipeline.