Spike!
Description
We can implement the Commons Impact Metrics calculations in two ways:
- Using a Commons category allow-list that will reduce the amount of categories that we tackle (only GLAM and other institutions/affiliates)
- Not using any allow-list and calculating them for all Commons categories regardless of who created them.
Both would use practically the same code, and the "all categories" option would bring more value to the community.
However, "all categories" represents a bigger data engineering challenge, since the source data is much bigger;
and thus the data at each step of the pipeline would also be bigger, even the resulting dumps and base data for AQS.
In this task we aim to evaluate as quickly as possible how much more difficult will be to go for all categories versus allow-list.
This way we can decide which way to go with the time that we have for this project.
Acceptance Criteria
- We have a sensible estimation of how much time and effort it would take be to calculate the metrics for all categories
Required
- Review the prototype code and look for major inefficiencies (we wrote it in a hurry)
- Prepare (adapt) it to run against all categories
- Troubleshoot execution, tweak Spark parameters, follow execution logs/graphs
- Estimate how much time and effort it would take