The Commons Impact Metrics prototype code was made quite quickly, and would benefit from some grooming before turning it into production jobs.
Tasks:
This task is about going over the Scala-Spark script and each of the SparkSQL queries, and for each one:
* Make any obvious performance improvements (i.e. refactor cross-join into regular join).
* Standardize the code (naming, style, structure).
* Add any changes coming from the feedback provided by the community.
* Parametrize it to be executed by the command line and also Airflow (Spark ${params}).
* Add comments at the top (general explanation, usage example, etc.) and also wherever necessary.
* Test it and vet the resulting data.
* Make sure it is placed under the expected path in the expected repo.
Definition of done:
[] Each SparkSql or SparkScala file is correct, optimized, standardized, in its rightful place, code-reviewed and ready to be executed by Airflow.