Unit testing (& CI) for dashboard data retrieval scripts
Closed, DeclinedPublic10 Story Points

Description

Two of our (Discovery's Analysis team's) products are the Discovery Dashboards and the Data that those dashboards use (e.g. https://datasets.wikimedia.org/aggregate-datasets/portal/portal_pageviews.tsv). Our "golden (retriever)" repository is a collection of R scripts that run MySQL and Hive queries to fetch data, perform additional post-processing steps, and write the data out to the various directories in aggregate-datasets (portal/, external_traffic/, search/, maps/, and wdsqs/).

When we make patches to golden (e.g. changes to existing code or adding new scripts for fetching new data), we should have a better system for checking & verifying those changes than manually running subsets of code in an SSH session. And even then, there might be mistakes that we don't catch. Like when we expect a date format "YYYY-MM-DD" but a missing step causes dates to be written as "YYYYMMDD", or when we're storing the fetched data in an object called "data" but instead try to write an nonexistent object to disk. These are small mistakes that would be caught through continuous integration and unit tests for data.

The goal of this task is two-fold:

  • To refactor golden to have data quality control that checks for:
    • Formats and types (e.g. character when expecting numeric)
    • Acceptable values
    • Whether the data can be written to disk (e.g. are we trying to append data that has columns that the existing dataset file doesn't have?)
    • Whether data can be backfilled (currently, there's an unknown issue that prevents us from backfilling the way we hope to, so we must do it manually)
  • To implement CI so that patches to golden undergo an automated build like other WMF products have (e.g. MediaWiki, Wikipedia.org Portal, and Analytics Refinery Source).

Note that I don't actually know whether it's possible for us to have a "jenkins-bot" that can execute R code on stat1002, but if it's not possible then perhaps we can hack together an alternative solution.

mpopov created this task.Sep 12 2016, 8:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2016, 8:05 PM
mpopov updated the task description. (Show Details)Sep 12 2016, 8:09 PM
debt triaged this task as Normal priority.Sep 22 2016, 8:19 PM
debt moved this task from Needs triage to Later on the Discovery-Analysis board.

Note for future self: assertr 2.0 just released on CRAN. From http://www.onthelambda.com/2017/03/20/data-validation-with-the-assertr-package/:

  • For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. assertr calls this verb assert.
  • For every element in a column, you want to make sure certain criteria are met but the criteria can only be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. assertr calls this verb insist.
  • For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated. assertr calls this verb assert_rows.
  • For every row of a dataset, you want to make sure certain assumptions hold but the criteria can only be decided only after looking at the entire column as a whole. This closely mirrors the distinction between assert and insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr calls this verb insist_rows.
  • You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc… assertr calls this last verb verify.

Change 362107 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/polloi@master] Fix spline smoothing and add tests

https://gerrit.wikimedia.org/r/362107

debt closed this task as Declined.Sep 21 2017, 9:55 PM
debt added a subscriber: debt.

I think we've gotten to where we need to go for this ticket. It'd be great to continue but it's not a priority at this time.