Two of our (Discovery's Analysis team's) products are the Discovery Dashboards and the Data that those dashboards use (e.g. https://datasets.wikimedia.org/aggregate-datasets/portal/portal_pageviews.tsv). Our "golden (retriever)" repository is a collection of R scripts that run MySQL and Hive queries to fetch data, perform additional post-processing steps, and write the data out to the various directories in aggregate-datasets (portal/, external_traffic/, search/, maps/, and wdsqs/).
When we make patches to golden (e.g. changes to existing code or adding new scripts for fetching new data), we should have a better system for checking & verifying those changes than manually running subsets of code in an SSH session. And even then, there might be mistakes that we don't catch. Like when we expect a date format "YYYY-MM-DD" but a missing step causes dates to be written as "YYYYMMDD", or when we're storing the fetched data in an object called "data" but instead try to write an nonexistent object to disk. These are small mistakes that would be caught through continuous integration and unit tests for data.
The goal of this task is two-fold:
- To refactor golden to have data quality control that checks for:
- Formats and types (e.g. character when expecting numeric)
- Acceptable values
- Whether the data can be written to disk (e.g. are we trying to append data that has columns that the existing dataset file doesn't have?)
- Whether data can be backfilled (currently, there's an unknown issue that prevents us from backfilling the way we hope to, so we must do it manually)
- To implement CI so that patches to golden undergo an automated build like other WMF products have (e.g. MediaWiki, Wikipedia.org Portal, and Analytics Refinery Source).
Note that I don't actually know whether it's possible for us to have a "jenkins-bot" that can execute R code on stat1002, but if it's not possible then perhaps we can hack together an alternative solution.