Page MenuHomePhabricator

Make data quality stats alert only if anomalous metrics change
Open, HighPublic

Description

To avoid repeated data quality alarms (the ones we have from Oozie right now),
we should make the data quality job only alert when the list of anomalous metrics changes, for example, when it changes from:

[no anomalous metrics]

to

metric1 45
metric2 23

or when it changes from

metric1 45
metric2 23

to

metric1 45
metric2 23
metric3 65

Related Objects

Event Timeline

@ssingh Hey! Do you have bandwidth to work on this in the end? We have had more ideas, that might turn this into an easier task.

Yes, sure! I think I can take care of the systemd timer part.

@ssingh Oh, cool. :]
Maybe we can even leave the spark job running as is now (with very few changes on our side),
and in the systemd timer job, just check for the existence of an anomaly file.
The spark job writes the anomalies file under the directory: hdfs://analytics-hadoop/tmp/analytics/anomalies/
Now we'll have to think a way to parse the file name because it's the one that contains what metric was anomalous, i.e.:
${source_table}-${query_name}-${granularity}-${year}-${month}-${day}-${hour}
Feel free to ping me whenever you tackle this!
Thanks!

@ssingh @elukey

I've been looking into this a bit and have had some second thoughts.

Current approach:

  • Oozie sends emails whenever the script runs and finds anomalies. That means, more often than not, that alerts are repeated every hour/day that the anomaly is ongoing. <- Annoying
  • The anomalous metric (in our case the country whose traffic is anomalous) and the corresponding deviation are present in an attached file in the email. <- Nice

Icinga approach:

  • Icinga sens emails and/or pages when the script changes from 0 exit status to non-0, or viceversa (recovery). Would not repeat unnecessary alerts. <- Nice
  • The anomalous metric and the corresponding deviation can not be attached to the email. The "alertee" needs to track down the alert (possibly ssh-ing into a stats machine) to get to know that data. <- Annoying
  • After a deeper look, this idea would need some significant changes in the data quality pipeline. <- Inconvenient at the moment

So, what if we use the current anomaly file as state? Currently, every time the script detects an anomaly, it writes a file to HDFS. With some changes to the job, we could have the same script read the previous file, if existing, and alarm only when the anomaly is new. With this approach:

  • Oozie would send emails only for the new anomalies (no repeated alerts). <- Nice
  • The anomalous metric (in our case the country whose traffic is anomalous) and the corresponding deviation would still be present in an attached file in the email. <- Nice
  • A little less work than the Icinga option <- Nice :]

Notes:

  • Not sure the Icinga PROBLEM vs RECOVERY approach makes sense with time series anomaly detection. I believe the RECOVERY alert would have no valuable information, since usually will happen when the RSVD algorithm "gets used" to the new time series tendency. Also, sometimes, Icinga might skip true positives, when i.e. there's a slightly high value for one day (which gives non-0 exit code), and the next day there's a very low value (also non-0 exit), which would be ignored IIUC.
  • Icinga's ability to send pages is desirable. However, if the "alertee" receives a page, but can not know what metric or what deviation it has until they have access to a computer, the page becomes less useful. Plus, I believe for now, these alarms are not that critical.
  • If we stick with Oozie, I think there's a way to add the affected metrics and their deviation in the email text.

Toughts?

@ssingh @elukey

I've been looking into this a bit and have had some second thoughts.

Thanks for summarizing this, @mforns!

Current approach:

  • Oozie sends emails whenever the script runs and finds anomalies. That means, more often than not, that alerts are repeated every hour/day that the anomaly is ongoing. <- Annoying
  • The anomalous metric (in our case the country whose traffic is anomalous) and the corresponding deviation are present in an attached file in the email. <- Nice

Yes I agree, I feel the current setup is nice and is serving our purpose well, other than the repeated alarms. The other benefit of email here is that you don't have to read it in real-time and the results are shared with multiple people.

Icinga approach:

  • Icinga sens emails and/or pages when the script changes from 0 exit status to non-0, or viceversa (recovery). Would not repeat unnecessary alerts. <- Nice

That's true.

  • The anomalous metric and the corresponding deviation can not be attached to the email. The "alertee" needs to track down the alert (possibly ssh-ing into a stats machine) to get to know that data. <- Annoying

Can we customize the Icinga notification so that this information can be added there and therefore be part of the email as well (through VictorOps)? I am guessing @elukey knows about the Icinga part.

  • After a deeper look, this idea would need some significant changes in the data quality pipeline. <- Inconvenient at the moment

Yes, this requires more work than our current setup which just works :) I do see the benefits of the Icinga approach that the alarms would resolve themselves but given the nature of how we investigate this stuff, an email is probably better. This means that if we do go with Icinga we would in addition need a email report unless I am missing something about the current setup where this would happen automatically.

So, what if we use the current anomaly file as state? Currently, every time the script detects an anomaly, it writes a file to HDFS. With some changes to the job, we could have the same script read the previous file, if existing, and alarm only when the anomaly is new. With this approach:

  • Oozie would send emails only for the new anomalies (no repeated alerts). <- Nice
  • The anomalous metric (in our case the country whose traffic is anomalous) and the corresponding deviation would still be present in an attached file in the email. <- Nice
  • A little less work than the Icinga option <- Nice :]

How much work do you think this is, given that this will fix this one issue we have with the otherwise acceptable current approach? Would something like: "if a report has been generated for country X in the past 12 hours, ignore" be enough?

  • Not sure the Icinga PROBLEM vs RECOVERY approach makes sense with time series anomaly detection. I believe the RECOVERY alert would have no valuable information, since usually will happen when the RSVD algorithm "gets used" to the new time series tendency. Also, sometimes, Icinga might skip true positives, when i.e. there's a slightly high value for one day (which gives non-0 exit code), and the next day there's a very low value (also non-0 exit), which would be ignored IIUC.

I think it's fine to just have "problem" and not have "recovery". I say this because if we do have a censorship or shutdown event, the recovery of that event has to be manually followed up in any case so Icinga alerting us is not beneficial. (That also gets into the other problem of defining a "recovery" event.)

  • Icinga's ability to send pages is desirable. However, if the "alertee" receives a page, but can not know what metric or what deviation it has until they have access to a computer, the page becomes less useful. Plus, I believe for now, these alarms are not that critical.
  • If we stick with Oozie, I think there's a way to add the affected metrics and their deviation in the email text.

This is on my wishlist but I never mentioned it because I don't want to be that person that has lots of feature requests but doesn't actually submit a pull-request with the code! But yes, if we do this it would be nice though I think this is not a priority, unlike fixing the repeated alarms.

Please let me know how I can help. I will restart the conversation about the Icinga thing and confirm if we can hook it up with VictorOps/email/text but I feel if fixing the current system is easier, that's the path we should take.

(Speaking strictly for myself here, I don't mind the repeated emails as I look at them and quickly ignore them if I have already seen a report about a given country. I think putting the text in the email body can make that process even faster so we can think about doing that not worry about fixing the other problem for now. I do think it is annoying but given how grateful I am for this new system, it's a small problem to have :)

The anomalous metric and the corresponding deviation can not be attached to the email. The "alertee" needs to track down the alert (possibly ssh-ing into a stats machine) to get to know that data. <- Annoying

Can we customize the Icinga notification so that this information can be added there and therefore be part of the email as well (through VictorOps)? I am guessing @elukey knows about the Icinga part.

I don't know about VictorOps! But from what @elukey and I discussed, it seems an Icinga alarm can only contain 1 custom field, which should be a link to a troubleshoot run-book wiki page. Thus, the "alertee" would have to get the affected city and deviation from another source.

So, what if we use the current anomaly file as state? Currently, every time the script detects an anomaly, it writes a file to HDFS. With some changes to the job, we could have the same script read the previous file, if existing, and alarm only when the anomaly is new. With this approach:

  • Oozie would send emails only for the new anomalies (no repeated alerts). <- Nice
  • The anomalous metric (in our case the country whose traffic is anomalous) and the corresponding deviation would still be present in an attached file in the email. <- Nice
  • A little less work than the Icinga option <- Nice :]

How much work do you think this is, given that this will fix this one issue we have with the otherwise acceptable current approach? Would something like: "if a report has been generated for country X in the past 12 hours, ignore" be enough?

I think we could implement this as part of the spark scala code, with some changes in the oozie job workflow. The spark job would write an anomaly file whenever the deviation for a given metric is outstanding. But it would only alert if no anomaly file exists for the previous run; or if it exists but the new anomalies are not "a repetition" of the former.

  • The anomalous metric and the corresponding deviation can not be attached to the email. The "alertee" needs to track down the alert (possibly ssh-ing into a stats machine) to get to know that data. <- Annoying

Can we customize the Icinga notification so that this information can be added there and therefore be part of the email as well (through VictorOps)? I am guessing @elukey knows about the Icinga part.

If possible I'd avoid trying to follow this road, Icinga afaik doesn't support it and I would be worried of people reviewing the alert without having received the email with the details (say it ended up in the spam filter, or another SRE reviews it, etc..). I proposed to Marcel to add a link to Wikitech with a runbook to allow anybody to check the status of the alert if needed (for example checking on HDFS etc..). Let me know if it makes sense :)

I think that Marcel's proposal is good, even if I don't love alarming with emails for the aforementioned issues, but this is really a special case so I would suggest to try and see how it goes (maybe we can review it after some months).

Another thing that might be useful: do we need to create a new mailing list for these kind of alerts so all people interested can be subscribed? (Not a public list but something that we control, like analytics-alert@ etc..)

mforns renamed this task from Separate RSVD anomaly detection into a systemd timer for better alarming with Icinga to Make data quality stats alert only if anomalous metrics change.Feb 25 2021, 1:02 AM
mforns updated the task description. (Show Details)
Milimetric subscribed.

We'll look at possible ways to improve this as we move data quality jobs to AirFlow