[Commons Impact Metrics] Productionize SparkSQL and Spark-Scala
Open, HighPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	mforns
	Feb 28 2024, 4:29 PM

Description

The Commons Impact Metrics prototype code was made quite quickly, and would benefit from some grooming before turning it into production jobs.
We should do this after we know what feedback we got from the Community in T358688, and how we represent the allow-list T358695.

Tasks:

This task is about going over the Scala-Spark script and each of the SparkSQL queries, and for each one:

Make any obvious performance improvements (i.e. refactor cross-join into regular join).
Standardize the code (naming, style, structure).
Add any changes coming from the feedback provided by the community.
Parametrize it to be executed by the command line and also Airflow (Spark ${params}).
Add comments at the top (general explanation, usage example, etc.) and also wherever necessary.
Test it and vet the resulting data.
Make sure it is placed under the expected path in the expected repo.

Definition of done:

Each SparkSql or SparkScala file is correct, optimized, standardized, in its rightful place, code-reviewed and ready to be executed by Airflow.

Details

Other Assignee: mforns

Subject	Repo	Branch	Lines +/-
Calculate deep counts for primary categories only.	analytics/refinery	master	+9 -6
Productionize CommonsCategoryGraphBuilder for CIM project	analytics/refinery/source	master	+321 -0
Clean up and parameterize SQL code for Common Impact Metrics.	analytics/refinery	master	+812 -555
WIP Latest modifications to Commons Impact Metrics code	analytics/refinery	master	+310 -48

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T358673 [Epic] Commons Impact Metrics Implementation
		Open		xcollazo	T358681 [Commons Impact Metrics] Productionize SparkSQL and Spark-Scala

Event Timeline

mforns created this task.Feb 28 2024, 4:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2024, 4:29 PM

mforns added a parent task: T358673: [Epic] Commons Impact Metrics Implementation.Feb 28 2024, 4:30 PM

mforns updated the task description. (Show Details)Feb 28 2024, 4:57 PM

mforns updated the task description. (Show Details)Feb 28 2024, 5:34 PM

mforns updated the task description. (Show Details)Feb 28 2024, 5:45 PM

mforns mentioned this in T358699: [Commons Impact Metrics] Create Airflow job that generates the datasets in Iceberg.Feb 28 2024, 5:54 PM

mforns mentioned this in T358673: [Epic] Commons Impact Metrics Implementation.Feb 28 2024, 9:02 PM

VirginiaPoundstone moved this task from Incoming requests to Q3 23/24 on the Commons-Impact-Metrics board.Mar 1 2024, 6:25 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 10); removed Data Products.Mar 4 2024, 3:59 PM

Sfaci assigned this task to xcollazo.Mar 4 2024, 5:21 PM

Sfaci moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 10) board.

Change 1008942 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] WIP Latest modifications to Commons Impact Metrics code

https://gerrit.wikimedia.org/r/1008942

gerritbot added a project: Patch-For-Review.Mar 5 2024, 7:22 PM

WDoranWMF moved this task from In Process to Protal to Sprint 11 on the Data Products (Data Products Sprint 10) board.Mar 21 2024, 2:04 PM

WDoranWMF edited projects, added Data Products (Data Products Sprint 11); removed Data Products (Data Products Sprint 10).

WDoranWMF moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 11) board.

WDoranWMF moved this task from In Process to Paused on the Data Products (Data Products Sprint 11) board.

mforns set the point value for this task to 8.Mar 21 2024, 2:14 PM

mforns changed the point value for this task from 8 to 13.

mforns moved this task from Paused to In Process on the Data Products (Data Products Sprint 11) board.Mar 25 2024, 4:14 PM

Change #1015013 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery/source@master] Productionize CommonsCategoryGraphBuilder for CIM project

https://gerrit.wikimedia.org/r/1015013

xcollazo updated Other Assignee, added: mforns.Mar 27 2024, 4:11 PM

VirginiaPoundstone triaged this task as High priority.Mar 29 2024, 4:25 PM

Change #1016796 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] WIP: Clean up and parameterize SQL code for Common Impact Metrics.

https://gerrit.wikimedia.org/r/1016796

xcollazo moved this task from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 11) board.Thu, Apr 11, 6:38 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 12); removed Data Products (Data Products Sprint 11).Fri, Apr 12, 4:00 PM

VirginiaPoundstone moved this task from Sprint Backlog to Code Review / Tech Input on the Data Products (Data Products Sprint 12) board.

VirginiaPoundstone moved this task from Code Review / Tech Input to In Process on the Data Products (Data Products Sprint 12) board.Mon, Apr 15, 4:05 PM

Change #1015013 merged by Mforns:

[analytics/refinery/source@master] Productionize CommonsCategoryGraphBuilder for CIM project

https://gerrit.wikimedia.org/r/1015013

Change #1016796 merged by Mforns:

[analytics/refinery@master] Clean up and parameterize SQL code for Common Impact Metrics.

https://gerrit.wikimedia.org/r/1016796

xcollazo moved this task from In Process to To Deploy on the Data Products (Data Products Sprint 12) board.Thu, Apr 18, 3:53 PM

xcollazo moved this task from To Deploy to In Process on the Data Products (Data Products Sprint 12) board.Tue, Apr 23, 3:59 PM

Change #1023494 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Calculate deep counts for primary categories only.

https://gerrit.wikimedia.org/r/1023494

Change #1023494 abandoned by Xcollazo:

[analytics/refinery@master] Calculate deep counts for primary categories only.

Reason:

Superseded by 1023491

https://gerrit.wikimedia.org/r/1023494

xcollazo moved this task from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 12) board.Fri, Apr 26, 4:31 PM

mforns moved this task from Code Review / Tech Input to Sign Off on the Data Products (Data Products Sprint 12) board.Tue, Apr 30, 1:40 PM

[Commons Impact Metrics] Productionize SparkSQL and Spark-ScalaOpen, HighPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

[Commons Impact Metrics] Productionize SparkSQL and Spark-Scala
Open, HighPublic13 Estimated Story Points
Actions

Related Objects
Search...