While we have alarms on volume of data processed for eventlogging refine we would fail to notice the case in which the data has been processed but to bogus values due to a bug on the refine process. For example: all pageviews have the same page title or see: https://phabricator.wikimedia.org/T211833 (all user agents for all requests for all fields are set to null)
Detecting issues such as these (without introspecting every schema) I think could be done alarming on the entrophy of a given column. Specially columns that are always present like userAgent or Country which have a set of possible values. What these alarms would do intuitively is to measure the "information" in a column using a measure of randomness. In the case of all userAgents being null there is no randomness and that would be flagged as a problem. It would be a more sophisticated view of the variety of values than the one a "select distinct userAgent" could provide but the idea is similar.
Entrophy for a variable X that can take N values. P is the probability of a value i.
E = - Sum(from 0 to N) P(i)* log (base 2) P(i)
Entropy calculation: https://gist.github.com/nuria/3204691aea95b2e6f3c97e3a593dee69
As an example I calculated entropy on 3 different hours of geocoded countries on navigation timing, for different days on 2019
select geocoded_data["country_code"], count(*) from navigationtiming where year=2019 and day=21 and hour=01 and month=01 group by geocoded_data["country_code"] limit 10000000;
Produces a series like:
So every hour will have a different series.
Entropy for the three hours is pretty constant (entropy is bounded by the log of number of samples)
nuria@stat1007:~/workplace/entrophy$ python calculate_entropy.py data1.txt
Entrophy: 4.32724679877 Upper bound 7.20945336563 :
nuria@stat1007:~/workplace/entrophy$ python calculate_entropy.py data2.txt
Entrophy: 4.49219034087 Upper bound 7.08746284125 :
nuria@stat1007:~/workplace/entrophy$ python calculate_entropy.py data3.txt
Entrophy: 4.06342383136 Upper bound 7.09803208296 :
So an alarm that looks for a deviation from 4 plus minus some amount (might be one standard dev, we will need to determine it empirically) will detect the issue of us failing to geolocate a number of countries for example.