Page MenuHomePhabricator

[Commons Impact Metrics] Create Airflow job that generates the datasets in Iceberg
Open, MediumPublic5 Estimated Story Points

Description

After we know where the allow-list lives T358695, and have the SparkSql queries and the SparkScala script productionized T358681,
we should implement an Airflow dag that uses them to populate the Commons Impact Metric datasets as Iceberg tables.
The DAG should have a monthly granularity (unless after Community feedback we decide on something else T358688).
It should execute each subsequent step of the pipeline storing intermediate results in temporary tables and pass them to the next operator.
The final results should populate the 5 Commons Impact Metrics datasets as Iceberg tables.

Tasks:

  • Write the DAG
  • Test it in the development instance
  • Code review and deploy

Definition of done:

  • The DAG is running in production, and the datasets are being populated every month.

Event Timeline

mforns renamed this task from [Commons Impact Metrics] Create Airflow job that generates the Commons Impact Metrics datasets in Iceberg to [Commons Impact Metrics] Create Airflow job that generates the datasets in Iceberg.Feb 28 2024, 6:01 PM
mforns set the point value for this task to 5.Mar 21 2024, 2:15 PM

Change #1021365 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Commons Impact Metrics queries - Correct order of insert

https://gerrit.wikimedia.org/r/1021365

Change #1021365 merged by Mforns:

[analytics/refinery@master] Commons Impact Metrics queries - Correct order of insert

https://gerrit.wikimedia.org/r/1021365

Change #1023491 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Modify Commons Impact Metrics queries to ignore ancestor categories

https://gerrit.wikimedia.org/r/1023491

Change #1023492 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery/source@master] Correctly apply distanceToPrimary in CommonsCategoryGraphBuilder

https://gerrit.wikimedia.org/r/1023492

Change #1023492 merged by jenkins-bot:

[analytics/refinery/source@master] Correctly apply distanceToPrimary in CommonsCategoryGraphBuilder

https://gerrit.wikimedia.org/r/1023492

Change #1023491 merged by Mforns:

[analytics/refinery@master] Modify Commons Impact Metrics queries to ignore ancestor categories

https://gerrit.wikimedia.org/r/1023491