Nothing pressing but this task serves as a placeholder for a few possible improvements:
- Change the fixed start-date for data gathering to a sliding window that e.g., pulls just the past two years worth of data
- Experiment with correct spark settings (regular vs. large, # of partitions, etc.)
- Improve logging to watch for anomalies -- e.g., verifying that # of users, size of files, etc. are within some tolerance of previous month's run
- Explore whether the job can be expanded to all of Wikipedia (and possibly Commons/Wikidata) without breaking...
- Streamlining of data extraction: only extract the fields we use and filter out bots from the start