As a consumer of analytics datasets I want invalid data to be quickly recognized so that I have good data to support deployed features.
We have a number of datasets, both intermediate and final, being generated in the analytics cluster. Today we don't have any particular way of verifying that something dramatic hasn't changed week-to-week. We should review best practices around validating data correctness and come up with a process that can be somewhat generically applied to our pipelines.
General idea:
- Apply heuristics such as comparing data size vs last week. This should vary depending on the data source, something like glent.m0prep should grow every week. Something like mjolnir.query_clicks_ltr should be within a few % of the previous weeks run.
- Have ability to check data size across multiple dimensions, for example counts per wiki.
- Should be configuration driven, such that a single script can be configured to support the needs of different data.
- Figure out how to organize attaching these jobs at each step of the airflow dag.
Bonus points:
For privacy reasons we don't keep old datasets around. It would be nice if in addition to the automated process the various metrics checked were recorded somewhere. We could split the checking process into two parts, the first part creates a document describing the dataset and inserts it into elasticsearch (soon™ relforge will live in analytics network) . The second part could either query elasticsearch, or pull the document for the most recent run(s) to evaluate it's heuristics. This has the side benefit that we can build kibana dashboards on relforge that display these dataset heuristics.