Page MenuHomePhabricator

[Commons Impact Metrics] Create Airflow job that formats and loads the data to Cassandra for AQS
Closed, ResolvedPublic13 Estimated Story Points

Description

After the base pipeline is ready T358699, and the API and Cassandra/Druid datasource design are ready T358679,
we should create an Airflow DAG that formats and loads the monthly data to Cassandra.
This work includes writing SparkSql queries to format the data into the desired shape.
The DAG should execute those queries, and load the results to Cassandra.
This DAG should also be monthly, and run right after the base pipeline finishes (a sensor should wait for the base datasets to be present for the month in question).
This task will also require to coordinate with Eric Evans to create the tables in Cassandra.

Tasks:

  • Coordinate with Eric Evans to create the tables in Cassandra.
  • Write the queries that format the base Commons Impact Metrics datasets into the expected shape.
  • Write the Airflow DAG that waits for the base data to be present, executes the queries and loads the data to Cassandra.
  • Test in Airflow's dev instance
  • Vet the data in Cassandra
  • Code-review and deploy

Definition of done:

  • The queries work properly and are in the corresponding repo (probably refinery?)
  • The DAG is in production and running
  • The data is accessible in Cassandra as expected.

Event Timeline

mforns renamed this task from [Commons Impact Metrics] Create Airflow job that formats and loads the data to Cassandra (or Druid) for AQS to [Commons Impact Metrics] Create Airflow job that formats and loads the data to Cassandra for AQS.Mar 21 2024, 2:22 PM
mforns updated the task description. (Show Details)
mforns set the point value for this task to 13.

EDIT: Removed original CREATEs from this comment to avoid confusion and to have the CREATEs publicly documented in a single place: T362697

Wrote down the CREATE TABLES according to the spec, and validated them againts a local Cassandra instance.

CREATEs so far:

CREATE KEYSPACE commons
    WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' }; -- this needs to be changed to whatever prod settings we typically use.


CREATE TABLE commons.category_metrics_snapshot (
    category                   VARCHAR,
    media_file_count           BIGINT,
    media_file_count_deep      BIGINT,
    used_media_file_count      BIGINT,
    used_media_file_count_deep BIGINT,
    leveraging_wiki_count      BIGINT,
    leveraging_wiki_count_deep BIGINT,
    leveraging_page_count      BIGINT,
    leveraging_page_count_deep BIGINT,
    month                      TIMESTAMP,
    PRIMARY KEY (category, month)
);


...

Local cassandra was happy with the DDL:

cqlsh:commons> describe tables;

edits_by_category        category_metrics_snapshot    top_wikis_by_category  
pageviews_by_category    top_pages_by_media_file      top_edited_categories  
pageviews_by_media_file  edits_by_user                top_editors            
top_viewed_media_files   media_file_metrics_snapshot  top_wikis_by_media_file
top_pages_by_category    top_viewed_categories

Change #1023461 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] WIP: SQL queries that format the base Commons Impact Metrics datasets into the expected shape for Cassandra.

https://gerrit.wikimedia.org/r/1023461

xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/681

Draft: Commons Impact Metrics job to ETL into Serving Layer in Cassandra.

Modified the CREATE TABLES according to latest changes to the spec, and validated them against a local Cassandra instance.

Local cassandra was happy with the DDL:

cqlsh:commons> desc tables;

edits_per_user_monthly            top_pages_per_media_file_monthly
top_editors_monthly               media_file_metrics_snapshot     
pageviews_per_media_file_monthly  top_edited_categories_monthly   
pageviews_per_category_monthly    top_pages_per_category_monthly  
category_metrics_snapshot         top_wikis_per_media_file_monthly
top_viewed_media_files_monthly    top_wikis_per_category_monthly  
edits_per_category_monthly        top_viewed_categories_monthly

Will now post this updated DDL at T362697.

Change #1023461 merged by Xcollazo:

[analytics/refinery@master] SQL queries that format the base Commons Impact Metrics datasets into the expected shape for the 14 Cassandra tables.

https://gerrit.wikimedia.org/r/1023461

VirginiaPoundstone raised the priority of this task from Medium to High.May 29 2024, 4:24 PM

Mentioned in SAL (#wikimedia-operations) [2024-06-06T19:30:49Z] <xcollazo@deploy1002> Started deploy [airflow-dags/analytics@a8843e6]: Deploying latest DAGs to the analytics Airflow instance. T358707.

Mentioned in SAL (#wikimedia-operations) [2024-06-06T19:31:16Z] <xcollazo@deploy1002> Finished deploy [airflow-dags/analytics@a8843e6]: Deploying latest DAGs to the analytics Airflow instance. T358707. (duration: 00m 26s)

Mentioned in SAL (#wikimedia-analytics) [2024-06-06T19:31:30Z] <xcollazo> Deploying latest DAGs to the analytics Airflow instance. T358707.

The DAG is now happily backfilling the last 7 months of data! 🎉 🎉 🎉

We are done here!