The Commons Impact Metrics prototype code was made quite quickly, and would benefit from some grooming before turning it into production jobs.
We should do this after we know what feedback we got from the Community in T358688, and how we represent the allow-list T358695.
Tasks:
This task is about going over the Scala-Spark script and each of the SparkSQL queries, and for each one:
- Make any obvious performance improvements (i.e. refactor cross-join into regular join).
- Standardize the code (naming, style, structure).
- Add any changes coming from the feedback provided by the community.
- Parametrize it to be executed by the command line and also Airflow (Spark ${params}).
- Add comments at the top (general explanation, usage example, etc.) and also wherever necessary.
- Test it and vet the resulting data.
- Make sure it is placed under the expected path in the expected repo.
Definition of done:
- Each SparkSql or SparkScala file is correct, optimized, standardized, in its rightful place, code-reviewed and ready to be executed by Airflow.