# Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metricsClosed, ResolvedPublic8 Estimated Story PointsActions

Assigned To
 mforns
Authored By
 • Nuria Feb 12 2019, 5:06 AM2019-02-12 05:06:00 (UTC+0)
Referenced Files
None

# Description

While we have alarms on volume of data processed for eventlogging refine we would fail to notice the case in which the data has been processed but to bogus values due to a bug on the refine process. For example: all pageviews have the same page title or see: https://phabricator.wikimedia.org/T211833 (all user agents for all requests for all fields are set to null)

Detecting issues such as these (without introspecting every schema) I think could be done alarming on the entropy of a given column. Specially columns that are always present like userAgent or Country which have a set of possible values. What these alarms would do intuitively is to measure the "information" in a column using a measure of randomness. In the case of all userAgents being null there is no randomness and that would be flagged as a problem. It would be a more sophisticated view of the variety of values than the one a "select distinct userAgent" could provide but the idea is similar.

Entropy for a variable X that can take N values. P is the probability of a value i.

E = - Sum(from 0 to N) P(i)* log (base 2) P(i)
https://en.wikipedia.org/wiki/Entropy_(information_theory)
Entropy calculation: https://gist.github.com/nuria/3204691aea95b2e6f3c97e3a593dee69

As an example I calculated entropy on 3 different hours of geocoded countries on navigation timing, for different days on 2019

Data:
select geocoded_data["country_code"], count(*) from navigationtiming where year=2019 and day=21 and hour=01 and month=01 group by geocoded_data["country_code"] limit 10000000;

Produces a series like:

AE 9
AG 1
AL 4
AO 1
AR 189
AT 20
AU 227
AW 1
... etc

So every hour will have a different series.
Entropy for the three hours is pretty constant (entropy is bounded by the log of number of samples)
nuria@stat1007:~/workplace/entrophy\$ python calculate_entropy.py data1.txt
Entropy: 4.32724679877 Upper bound 7.20945336563 :
nuria@stat1007:~/workplace/entrophy\$ python calculate_entropy.py data2.txt
Entropy: 4.49219034087 Upper bound 7.08746284125 :
nuria@stat1007:~/workplace/entrophy\$ python calculate_entropy.py data3.txt
Entropy: 4.06342383136 Upper bound 7.09803208296 :

So an alarm that looks for a deviation from 4 plus minus some amount (might be one standard dev, we will need to determine it empirically) will detect the issue of us failing to geolocate a number of countries for example.

### Event Timeline

Nuria triaged this task as Medium priority.Feb 12 2019, 5:06 AM
Nuria created this task.

I understand it, yay! And I like it. We could even compute the tolerance from past data once in a while, and use that instead of our guess. That way this approach could grow organically with the data. We should always have some absolute alarms like if entropy is ever 0 something went wrong. So we could put an entropy(min=0) check on pretty much every column.

Also, what I was talking about is completely orthogonal to this. I wasn't suggesting we introspect schemas, but that we organize how this logic is applied. So if, for example, we implement your entropy calculation as a UDF, then there should be a config file somewhere that maps columns to the quality checker(s) being applied. For example:

```QualityChecks:
- *All*:
- country:
- Entropy:
min: 2
max: 6
- userAgent:
- Distinct
- country:
- Entropy:
min: 3
max: 5```

I imagine we would add entropy-stats tables generated hourly (for hourly datasets). The entropy-generation code could (and should!) be generic and reusable, and the alarming mechanism as well I guess.

Milimetric renamed this task from Coarse alarm on data quality for refined data based on entrophy calculations to Coarse alarm on data quality for refined data based on entropy calculations.Feb 21 2019, 9:50 PM
Nuria renamed this task from Coarse alarm on data quality for refined data based on entropy calculations to Coarse alarm on data quality for refined data based on entrophy calculations.Feb 22 2019, 4:52 AM
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 516647 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Add entropy UDAF to refinery-hive

https://gerrit.wikimedia.org/r/516647

I added a design document here: https://docs.google.com/document/d/1gL7igq1AtsbZZL_5lQrAE7ak30lYrhXPPz1s-fdZREM
The questions that I think are still open are marked in orange.
Please, feel free to comment and modify!

Change 517620 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add oozie code for data quality metrics

https://gerrit.wikimedia.org/r/517620

Change 516647 merged by Nuria:
[analytics/refinery/source@master] Add entropy UDAF to refinery-hive

https://gerrit.wikimedia.org/r/516647

Change 518069 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge add deletion for data_quality_hourly

https://gerrit.wikimedia.org/r/518069

@fgiunchedi hi!

We are considering using graphite or prometheus (or any other WMF monitoring system) for this task.
We have a set of data quality timeline measurements in a table in Hive, and would like to send them to graphite (or make prometheus pull them).
This way we could have dashboards on data quality measurements.
The ultimate goal of this is to apply some forecasting/anomaly detection algorithms available in those monitoring systems (i.e. holt winters)
so that we can set up alarms (icinga?) that notify us whenever the last data point of such measurements is outside the expected normality range.

But we are not sure if that's feasible. What do you think? If yes, how should we approach it?

Thanks!

Change 517620 merged by Nuria:
[analytics/refinery@master] Add oozie code to calculate entropy values for data quality alarms

https://gerrit.wikimedia.org/r/517620

@fgiunchedi hi!

We are considering using graphite or prometheus (or any other WMF monitoring system) for this task.
We have a set of data quality timeline measurements in a table in Hive, and would like to send them to graphite (or make prometheus pull them).
This way we could have dashboards on data quality measurements.
The ultimate goal of this is to apply some forecasting/anomaly detection algorithms available in those monitoring systems (i.e. holt winters)
so that we can set up alarms (icinga?) that notify us whenever the last data point of such measurements is outside the expected normality range.

But we are not sure if that's feasible. What do you think? If yes, how should we approach it?

Thanks!

Hi @mforns, thanks for reaching out!
From the design document my understanding is that these metrics will be calculated periodically by a job, IOW there isn't a daemon that always runs and could present the metrics when Prometheus wants to pull them? For short lived jobs/processes the easiest strategy at the moment is still to go with statsd (and thus graphite, after statsd aggregation occurs), additionally if you'd like to have the metrics in Prometheus as well there is a "bridge" between statsd and Prometheus we've been deploying (statsd_exporter).

For additional context, the "Prometheus native" solution would be to use a Prometheus client but push metrics to Prometheus pushgateway, there's some caveats as explained in the README.md and we (SRE) haven't deployed the pushgateway yet, though I think we'll have to in the future exactly for use cases like you described.

HTH!

@fgiunchedi thanks a lot for the help!

From the design document my understanding is that these metrics will be calculated periodically by a job, IOW there isn't a daemon that always runs and could present the metrics when Prometheus wants to pull them?

Correct, the metrics are calculated periodically by a hadoop job, and there's no daemon than can serve prometheus pulls.

For short lived jobs/processes the easiest strategy at the moment is still to go with statsd (and thus graphite, after statsd aggregation occurs), additionally if you'd like to have the metrics in Prometheus as well there is a "bridge" between statsd and Prometheus we've been deploying (statsd_exporter).

Cool! That seems like an option.

For additional context, the "Prometheus native" solution would be to use a Prometheus client but push metrics to Prometheus pushgateway, there's some caveats as explained in the README.md and we (SRE) haven't deployed the pushgateway yet, though I think we'll have to in the future exactly for use cases like you described.

I read the README caveats that you mention, and I think there could be problems with this approach, because:

• We calculate the metrics in an hourly (or daily) resolution, so one measurement per hour (day). And the README says if 5 minutes pass without any push, then Prometheus will think the metric doesn't exist any more. Would that also happen with graphite?
• We calculate the metrics with a couple-hours lag because of data processing time. Say, at 16:15h we calculate the metrics for 12:00-13:00. and we'd like the metric time-series to be coherent with the database times. AFAICS, you can not push measurements to prometheus pushgateway with timestamps, so I believe that would be an issue. Right? Does statsd allow to push measurements with lagged timestamps?

Maybe, if you have time, we could discuss this quickly via google meet? In a 20/30-min meeting? We could invite Luca to that.

Change 518069 merged by Elukey:
[operations/puppet@production] analytics::refinery::job::data_purge add deletion for data_quality_hourly

https://gerrit.wikimedia.org/r/518069

@fgiunchedi thanks a lot for the help!

From the design document my understanding is that these metrics will be calculated periodically by a job, IOW there isn't a daemon that always runs and could present the metrics when Prometheus wants to pull them?

Correct, the metrics are calculated periodically by a hadoop job, and there's no daemon than can serve prometheus pulls.

Ok!

For short lived jobs/processes the easiest strategy at the moment is still to go with statsd (and thus graphite, after statsd aggregation occurs), additionally if you'd like to have the metrics in Prometheus as well there is a "bridge" between statsd and Prometheus we've been deploying (statsd_exporter).

Cool! That seems like an option.

For additional context, the "Prometheus native" solution would be to use a Prometheus client but push metrics to Prometheus pushgateway, there's some caveats as explained in the README.md and we (SRE) haven't deployed the pushgateway yet, though I think we'll have to in the future exactly for use cases like you described.

I read the README caveats that you mention, and I think there could be problems with this approach, because:

• We calculate the metrics in an hourly (or daily) resolution, so one measurement per hour (day). And the README says if 5 minutes pass without any push, then Prometheus will think the metric doesn't exist any more. Would that also happen with graphite?

The README says that Prometheus itself if it doesn't see a metric for 5 minutes it'll think it is stale, however a metric pushed to the pushgateway will stay there until deleted, so Prometheus will never think the metric is stale when it pulls metrics from the pushgateway. With Graphite / statsd you push the metric and that's it, if there are no datapoints the metric will have holes where there haven't been pushes.

• We calculate the metrics with a couple-hours lag because of data processing time. Say, at 16:15h we calculate the metrics for 12:00-13:00. and we'd like the metric time-series to be coherent with the database times. AFAICS, you can not push measurements to prometheus pushgateway with timestamps, so I believe that would be an issue. Right? Does statsd allow to push measurements with lagged timestamps?

statsd itself doesn't allow timestamps, however graphite protocol does. The difference being that statsd metrics get aggregated and flushed to graphite periodically (60s) so say if there were 100 statsd measurements for the same metric in a minute, that will result in a single aggregated datapoint written to graphite. I can see how you'd like to have data points aligned with database times, but yeah that's not possible with Prometheus (i.e. attach timestamps, with or without pushgateway).

Maybe, if you have time, we could discuss this quickly via google meet? In a 20/30-min meeting? We could invite Luca to that.

For sure, feel free to send an invite next week!

The README says that Prometheus itself if it doesn't see a metric for 5 minutes it'll think it is stale, however a metric pushed to the pushgateway will stay there until deleted, so Prometheus will never think the metric is stale when it pulls metrics from the pushgateway. With Graphite / statsd you push the metric and that's it, if there are no datapoints the metric will have holes where there haven't been pushes.

Oh, I see! Thanks for the clarification.

statsd itself doesn't allow timestamps, however graphite protocol does. The difference being that statsd metrics get aggregated and flushed to graphite periodically (60s) so say if there were 100 statsd measurements for the same metric in a minute, that will result in a single aggregated datapoint written to graphite. I can see how you'd like to have data points aligned with database times, but yeah that's not possible with Prometheus (i.e. attach timestamps, with or without pushgateway).

As neither Prometheus not statsd accept timestamps, it seems that the best option is to push directly to graphite.
We don't need statsd's ability to minutely aggregate, because the measurements are so sparse. OK, cool!

Maybe, if you have time, we could discuss this quickly via google meet? In a 20/30-min meeting? We could invite Luca to that.

For sure, feel free to send an invite next week!

I'll be in vacation next weeks. But if necessary will set up a meeting when I'm back :]
Thanks a lot!

Change 541557 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Add spark job to generate a data quality report

https://gerrit.wikimedia.org/r/541557

Nuria raised the priority of this task from Medium to High.Oct 9 2019, 11:27 PM
mforns renamed this task from Coarse alarm on data quality for refined data based on entrophy calculations to Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.Oct 15 2019, 10:19 AM

I still need to post the results of the proof of concept to Wikitech.

Nuria set the point value for this task to 8.

Change 541557 abandoned by Mforns:
Add spark job to generate a data quality report

Reason:
This is no longer valid after data quality alarms discussions and new direction.

https://gerrit.wikimedia.org/r/541557