Page MenuHomePhabricator

[Commons Impact Metrics] Productionize SparkSQL and Spark-Scala
Open, HighPublic13 Estimated Story Points

Description

The Commons Impact Metrics prototype code was made quite quickly, and would benefit from some grooming before turning it into production jobs.
We should do this after we know what feedback we got from the Community in T358688, and how we represent the allow-list T358695.

Tasks:

This task is about going over the Scala-Spark script and each of the SparkSQL queries, and for each one:

  • Make any obvious performance improvements (i.e. refactor cross-join into regular join).
  • Standardize the code (naming, style, structure).
  • Add any changes coming from the feedback provided by the community.
  • Parametrize it to be executed by the command line and also Airflow (Spark ${params}).
  • Add comments at the top (general explanation, usage example, etc.) and also wherever necessary.
  • Test it and vet the resulting data.
  • Make sure it is placed under the expected path in the expected repo.

Definition of done:

  • Each SparkSql or SparkScala file is correct, optimized, standardized, in its rightful place, code-reviewed and ready to be executed by Airflow.

Event Timeline

Change 1008942 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] WIP Latest modifications to Commons Impact Metrics code

https://gerrit.wikimedia.org/r/1008942

mforns set the point value for this task to 8.Mar 21 2024, 2:14 PM
mforns changed the point value for this task from 8 to 13.

Change #1015013 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery/source@master] Productionize CommonsCategoryGraphBuilder for CIM project

https://gerrit.wikimedia.org/r/1015013

Change #1016796 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] WIP: Clean up and parameterize SQL code for Common Impact Metrics.

https://gerrit.wikimedia.org/r/1016796

Change #1015013 merged by Mforns:

[analytics/refinery/source@master] Productionize CommonsCategoryGraphBuilder for CIM project

https://gerrit.wikimedia.org/r/1015013

Change #1016796 merged by Mforns:

[analytics/refinery@master] Clean up and parameterize SQL code for Common Impact Metrics.

https://gerrit.wikimedia.org/r/1016796

Change #1023494 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Calculate deep counts for primary categories only.

https://gerrit.wikimedia.org/r/1023494

Change #1023494 abandoned by Xcollazo:

[analytics/refinery@master] Calculate deep counts for primary categories only.

Reason:

Superseded by 1023491

https://gerrit.wikimedia.org/r/1023494