Page MenuHomePhabricator

Improve robustness of data processing pipeline
Closed, DeclinedPublic

Description

Nothing pressing but this task serves as a placeholder for a few possible improvements:

  • Change the fixed start-date for data gathering to a sliding window that e.g., pulls just the past two years worth of data
  • Experiment with correct spark settings (regular vs. large, # of partitions, etc.)
  • Improve logging to watch for anomalies -- e.g., verifying that # of users, size of files, etc. are within some tolerance of previous month's run
  • Explore whether the job can be expanded to all of Wikipedia (and possibly Commons/Wikidata) without breaking...
  • Streamlining of data extraction: only extract the fields we use and filter out bots from the start

Event Timeline

All: feel free to add ideas / break this down into subtasks etc.

Adding Platform Engineering as Platform Team Workboards (Green) was archived and as open tasks should have an active project tag

@fkaelin please review this task, work with Isaac to identify priority on our end, and submit the task(s) to the relevant teams as our request. I assign this task to you and feel free to change the assignment when you're done. Thanks!

After consulting with Isaac, I shall close this task as declined - we can reopen in the future if needed.