Page MenuHomePhabricator

Improve robustness of data processing pipeline
Open, Needs TriagePublic

Description

Nothing pressing but this task serves as a placeholder for a few possible improvements:

  • Change the fixed start-date for data gathering to a sliding window that e.g., pulls just the past two years worth of data
  • Experiment with correct spark settings (regular vs. large, # of partitions, etc.)
  • Improve logging to watch for anomalies -- e.g., verifying that # of users, size of files, etc. are within some tolerance of previous month's run
  • Explore whether the job can be expanded to all of Wikipedia (and possibly Commons/Wikidata) without breaking...

Event Timeline

All: feel free to add ideas / break this down into subtasks etc.