Page MenuHomePhabricator

Improve robustness of data processing pipeline
Open, Needs TriagePublic


Nothing pressing but this task serves as a placeholder for a few possible improvements:

  • Change the fixed start-date for data gathering to a sliding window that e.g., pulls just the past two years worth of data
  • Experiment with correct spark settings (regular vs. large, # of partitions, etc.)
  • Improve logging to watch for anomalies -- e.g., verifying that # of users, size of files, etc. are within some tolerance of previous month's run
  • Explore whether the job can be expanded to all of Wikipedia (and possibly Commons/Wikidata) without breaking...
  • Streamlining of data extraction: only extract the fields we use and filter out bots from the start