Two of our (Discovery's Analysis team's) products are [[ http://discovery.wmflabs.org/ | the Discovery Dashboards ]] and [[ https://meta.wikimedia.org/wiki/Discovery/Analytics#Public_Datasets | the Data ]] that those dashboards use (e.g. https://datasets.wikimedia.org/aggregate-datasets/portal/portal_pageviews.tsv). Our "[[ https://phabricator.wikimedia.org/diffusion/WDGO/ | golden ]] (retriever)" repository is a collection of R scripts that run MySQL and Hive queries to fetch data, perform additional post-processing steps, and write the data out to the various directories in [[ https://datasets.wikimedia.org/aggregate-datasets/ | aggregate-datasets ]] ([[ https://datasets.wikimedia.org/aggregate-datasets/portal/ | portal/ ]], [[ https://datasets.wikimedia.org/aggregate-datasets/external_traffic/ | external_traffic/ ]], [[ https://datasets.wikimedia.org/aggregate-datasets/search/ | search/ ]], [[ https://datasets.wikimedia.org/aggregate-datasets/maps/ | maps/ ]], and [[ https://datasets.wikimedia.org/aggregate-datasets/wdqs/ | wdsqs/ ]]).
When we make patches to golden (e.g. changes to existing code or adding new scripts for fetching new data), we should have a better system for checking & verifying those changes than manually running subsets of code in an SSH session. And even then, there might be mistakes that we don't catch. Like when we expect a date format "YYYY-MM-DD" but [[ https://gerrit.wikimedia.org/r/#/c/310048/ | a missing step ]] causes dates to be written as "YYYYMMDD", or when we're storing the fetched data in an object called "data" but instead try to [[ https://gerrit.wikimedia.org/r/#/c/310044/ | write an nonexistent object to disk ]]. These are small mistakes that would be caught through continuous integration and unit tests for data.
The goal of this task is two-fold:
- [ ] To refactor **golden** to have data quality control that checks for:
- Formats and types (e.g. character when expecting numeric)
- Acceptable values
- Whether the data can be written to disk (e.g. are we trying to append data that has columns that the existing dataset file doesn't have?)
- Whether data can be backfilled (currently, there's an unknown issue that prevents us from backfilling the way we hope to, so we must do it manually)
- [ ] To implement CI so that patches to golden undergo an automated build like other WMF products have (e.g. MediaWiki, Wikipedia.org Portal, and Analytics Refinery Source).
**Note** that I don't actually know whether it's possible for us to have a "[[ https://wikitech.wikimedia.org/wiki/Jenkins | jenkins-bot ]]" that can execute R code on stat1002, but if it's not possible then perhaps we can hack together an alternative solution.