Refactor the Wikidata Data Quality Report analytics procedures:
- refactor (most of) the data engineering code to work in the analytics cluster;
- it is now done in R on a single server across the data sets produced from Pyspark,
- by a process that eats up to 50Gb RAM memory on stat1007 - the Analytics-Engineering keep on killing it and for a good reason;
- everything must migrate to Pyspark and run in the analytics cluster.
Also:
- inspect what exactly was going so hard on the stat1007 resources: (a) the joins, or (b) rendering {ggplot2} visualizations;
- if (a) then we fix it simply by moving everything to the cluster, if (b) find the solution how to render the visualizations or visualize aggregated data sets only.