Coarse alarm on data quality for refined data based on entrophy calculations
Open, NormalPublic

Description

While we have alarms on volume of data processed for eventlogging refine we would fail to notice the case in which the data has been processed but to bogus values due to a bug on the refine process. For example: all pageviews have the same page title or see: https://phabricator.wikimedia.org/T211833 (all user agents for all requests for all fields are set to null)

Detecting issues such as these (without introspecting every schema) I think could be done alarming on the entrophy of a given column. Specially columns that are always present like userAgent or Country which have a set of possible values. What these alarms would do intuitively is to measure the "information" in a column using a measure of randomness. In the case of all userAgents being null there is no randomness and that would be flagged as a problem. It would be a more sophisticated view of the variety of values than the one a "select distinct userAgent" could provide but the idea is similar.

Entrophy for a variable X that can take N values. P is the probability of a value i.

E = - Sum(from 0 to N) P(i)* log (base 2) P(i)
https://en.wikipedia.org/wiki/Entropy_(information_theory)
Entropy calculation: https://gist.github.com/nuria/3204691aea95b2e6f3c97e3a593dee69

As an example I calculated entropy on 3 different hours of geocoded countries on navigation timing, for different days on 2019

Data:
select geocoded_data["country_code"], count(*) from navigationtiming where year=2019 and day=21 and hour=01 and month=01 group by geocoded_data["country_code"] limit 10000000;

Produces a series like:

AE 9
AG 1
AL 4
AO 1
AR 189
AT 20
AU 227
AW 1
... etc

So every hour will have a different series.
Entropy for the three hours is pretty constant (entropy is bounded by the log of number of samples)
nuria@stat1007:~/workplace/entrophy$ python calculate_entropy.py data1.txt
Entrophy: 4.32724679877 Upper bound 7.20945336563 :
nuria@stat1007:~/workplace/entrophy$ python calculate_entropy.py data2.txt
Entrophy: 4.49219034087 Upper bound 7.08746284125 :
nuria@stat1007:~/workplace/entrophy$ python calculate_entropy.py data3.txt
Entrophy: 4.06342383136 Upper bound 7.09803208296 :

So an alarm that looks for a deviation from 4 plus minus some amount (might be one standard dev, we will need to determine it empirically) will detect the issue of us failing to geolocate a number of countries for example.

Nuria created this task.Tue, Feb 12, 5:06 AM
Nuria triaged this task as Normal priority.
elukey added a subscriber: elukey.Tue, Feb 12, 7:55 AM

I understand it, yay! And I like it. We could even compute the tolerance from past data once in a while, and use that instead of our guess. That way this approach could grow organically with the data. We should always have some absolute alarms like if entropy is ever 0 something went wrong. So we could put an entropy(min=0) check on pretty much every column.

Also, what I was talking about is completely orthogonal to this. I wasn't suggesting we introspect schemas, but that we organize how this logic is applied. So if, for example, we implement your entropy calculation as a UDF, then there should be a config file somewhere that maps columns to the quality checker(s) being applied. For example:

QualityChecks:
    - *All*:
           - country:
                 - Entropy:
                       min: 2
                       max: 6
           - userAgent:
                 - Distinct
    - NavigationTiming:
           - country:
                 - Entropy:
                       min: 3
                       max: 5
Tbayer added a subscriber: Tbayer.Thu, Feb 14, 1:28 AM

I imagine we would add entropy-stats tables generated hourly (for hourly datasets). The entropy-generation code could (and should!) be generic and reusable, and the alarming mechanism as well I guess.