Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager
Open, MediumPublic
Actions

Assigned To

None

Authored By

	BTullis
	May 19 2023, 4:14 PM

Description

The Data-Engineering team currently has some monitoring workflows that are currently at-odds with the way we alert about systemd unit failures.

There's certainly room for improvement in our ops week processes which we use when responding to these failures, but I'm wondering if there is any other solution that can help us to reduce the amount of IRC spam in particular that we see from these monitors.

For context, we have a number of systemd timers that run on an-launcher1002 which are used to detect data quality issues on the HDFS file system.

btullis@an-launcher1002:~$ systemctl list-units monitor*
UNIT                                                     LOAD   ACTIVE SUB     DESCRIPTION                                                                     
monitor_refine_event.timer                               loaded active waiting Periodic execution of monitor_refine_event.service                              
monitor_refine_event_sanitized_analytics_delayed.timer   loaded active waiting Periodic execution of monitor_refine_event_sanitized_analytics_delayed.service  
monitor_refine_event_sanitized_analytics_immediate.timer loaded active waiting Periodic execution of monitor_refine_event_sanitized_analytics_immediate.service
monitor_refine_event_sanitized_main_delayed.timer        loaded active waiting Periodic execution of monitor_refine_event_sanitized_main_delayed.service       
monitor_refine_event_sanitized_main_immediate.timer      loaded active waiting Periodic execution of monitor_refine_event_sanitized_main_immediate.service     
monitor_refine_eventlogging_analytics.timer              loaded active waiting Periodic execution of monitor_refine_eventlogging_analytics.service             
monitor_refine_eventlogging_legacy.timer                 loaded active waiting Periodic execution of monitor_refine_eventlogging_legacy.service                
monitor_refine_netflow.timer                             loaded active waiting Periodic execution of monitor_refine_netflow.service

These monitor jobs are created by the puppet defined type profile::analytics::refinery::job::refine_job, when refine_monitor_enabled is true.

Most of these timers run on a daily schedule, although the start time is usually specified as per this example.

Our current procedures state that whichever member of our team is on Ops Week we should be responding to these alerts every day. We get an email from a scheuled monitor, which goes to data-engineering-alerts@lists.wikimedia.org.

However, what we have seen recently is that when we have a failed monitor like this, we also get IRC alerts to the #wikmedia-analytics IRC channel from Alertmanager, every 5 or 10 minutes. e.g.

We have struggled to find a good alerting mechanism for the use-case of:
does everything look correct for the last 48 hours' worth of data?
... as oppsed to:
is everything OK right now?

Longer term, we plan to migrate our refinery jobs and some other miscellaneous systemd timers to Airflow, where we have greater responsibility for the alerting mechanism (currently email as well).
Until we manage to do that, we're seeking some kind of solution that will help keep our IRC channel manageable.

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T345698 [Epic] define a strategy around alerting for Data Platform SRE and implement it
Open	None	T346438 [Epic] Review alerting strategy for Data Platform SRE
Open	None	T337052 Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager

Event Timeline

BTullis created this task.May 19 2023, 4:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2023, 4:14 PM

FWIW the monitor_refine_* are a safety step for noticing that something is wrong. The refine_* jobs themselves email a more specific alert about something going wrong when they run, and we usually use those to resolve a specific problem.

For IRC spam, I think it would be fine to turn off IRC reporting of monitor_refine_* alerts, if that is possible.

In T337052#8865325, @Ottomata wrote:

For IRC spam, I think it would be fine to turn off IRC reporting of monitor_refine_* alerts, if that is possible.

OK, thanks. I'll have a look and see what that would take.

I believe when we migrate Refine to Airflow the number of alerts will diminish.
One reason for that is that with Airflow we can configure a more robust retry approach, that should hopefully(1) reduce the number of alerts by a lot, since most of them are fixed nowadays by a manual retry.
Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now. And this will hopefully(3) better pinpoint/track the refine errors and eliminate the need for the monitor completely.
I feel like migrating Refine is probably the most valuable next step in the Airflow migration project.

In T337052#8865357, @mforns wrote:

Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now.

Thanks @mforns. Could you elaborate on this a little for me, please?

Will the refine job be running once per hour (per partition?), with a sensor for when the source data is ready?

I feel like migrating Refine is probably the most valuable next step in the Airflow migration project.

Sounds good to me. 👍

Will the refine job be running once per hour (per partition?), with a sensor for when the source data is ready?

Yup, that's what it does now, except the custom scheduling bit (RefineTarget) is used to launch all the necessary hourly dataset Refine jobs all at once. The scheduling bit is what we want to move to airflow.

In T337052, @BTullis wrote:

Until we manage to do that, we're seeking some kind of solution that will help keep our IRC channel manageable.

With my Observability hat on, I recommend not sending warnings on IRC (or at least not on a main channel with a wide audience). The ops week person will be able to check warnings at https://alerts.wikimedia.org at their leisure though!

I realize that's a bigger scope solution to this task's problem, however my rationale is the following:

Folks' attention is precious, and alert fatigue is real
If the alert is a warning then it can wait, if it can't wait then it should be a critical or page

hope that helps!

lmata awarded a token.May 30 2023, 12:42 PM

lmata subscribed.

In T337052#8865357, @mforns wrote:
Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now.

Thanks @mforns. Could you elaborate on this a little for me, please?

Of course (sorry I let this one slip).

Currently Refine runs every hour over a window of 44 hours before *now*. This is so, to account for late arrival of data.
But it also means that when an hourly partition fails refinement, it fails every time the partition falls within the Refine window.
So it can be that we get lots of alerts for the same partition.
I'm not super sure if that's the case, because I ignore the case when Refine runs but there's a failure flag in a given partition.
I don't knwo if it tries to re-run it anyway, or it skips it.

In Airflow, we should be able to avoid the window concept, and deal with late arrival in some other way.
This means, each failure will correspond with a red box in the Airflow UI and only alert once.

But it also means that when an hourly partition fails refinement, it fails every time the partition falls within the Refine window.

Not quite!

There is a _FAILURE flag that gets written, and by default previous failures are excluded from the next refinement. That's why we have to manually rerun those with --ignore-failure-flag=true. We only get one alert per Refine attempt of an hour (unless someone reruns and it fails again).

What does keep alerting is the RefineMonitor. RefineMonitor also uses a past window search, but looks for anything that hasn't been refined, or has failed to be refined, and alerts on that. This is a safeguard we added to make sure that we don't miss an individual email alert.

Once we have airflow we won't need RefineMonitor, because Airflow will keep state for each individual dataset that needs to be refined, and we can alert on that.

There is a _FAILURE flag that gets written, and by default previous failures are excluded from the next refinement. That's why we have to manually rerun those with --ignore-failure-flag=true. We only get one alert per Refine attempt of an hour (unless someone reruns and it fails again).

Gotcha! sorry for misleading...

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:27 PM

JArguello-WMF removed a project: Data-Platform-SRE.Jun 29 2023, 10:56 PM

BTullis added a project: Data-Platform-SRE.Jul 15 2023, 12:00 AM

Gehel triaged this task as Medium priority.Oct 18 2023, 8:56 AM

Gehel added a parent task: T346438: [Epic] Review alerting strategy for Data Platform SRE.

Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.

Gehel moved this task from Misc to Observability on the Data-Platform-SRE board.Dec 6 2023, 1:22 PM

bking subscribed.Apr 10 2024, 5:34 PM