This epic includes all the work related to develop and productionize the data pipeline for Commons Impact Metrics:
Queries and Spark code, Airflow jobs, Dumps, Public API, Allow-list management, Documentation and Applying insights from the Community feedback to the data model.
Step 0:
T358688: [Commons Impact Metrics] Understand feedback from Community and decide what changes to apply
T358695: [Commons Impact Metrics] Establish how we represent the allow-list
Step 1:
T358681: [Commons Impact Metrics] Productionize SparkSQL and Spark-Scala -> T358699: [Commons Impact Metrics] Create Airflow job that generates the datasets in Iceberg
T358679: [Commons Impact Metrics] Design API endpoints and Cassandra/Druid datasources
Step 2:
T358718: [Commons Impact Metrics] Create a new AQS service with all the endpoints
T358707: [Commons Impact Metrics] Create Airflow job that formats and loads the data to Cassandra for AQS
Step 3:
Continue T358718: [Commons Impact Metrics] Create a new AQS service with all the endpoints -> T358715: [Commons Impact Metrics] Add test data in AQS's test environments to back up new AQS service
T358719: [Commons Impact Metrics] Backfill datasets in Iceberg and Cassandra/Druid
T358722: [Commons Impact Metrics] Create API documentation
Step 4:
T358701: [Commons Impact Metrics] Create Airflow job that generates the public dumps -> T358710: [Commons Impact Metrics] Make dumps accessible from analytics.wikimedia.org
T358720: [Commons Impact Metrics] Create documentation of the main pipeline
T358712: [Commons Impact Metrics] Implement necessary tools and process to maintain the allow-list