Page MenuHomePhabricator

Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager
Open, MediumPublic

Description

The Data-Engineering team currently has some monitoring workflows that are currently at-odds with the way we alert about systemd unit failures.

There's certainly room for improvement in our ops week processes which we use when responding to these failures, but I'm wondering if there is any other solution that can help us to reduce the amount of IRC spam in particular that we see from these monitors.

For context, we have a number of systemd timers that run on an-launcher1002 which are used to detect data quality issues on the HDFS file system.

btullis@an-launcher1002:~$ systemctl list-units monitor*
UNIT                                                     LOAD   ACTIVE SUB     DESCRIPTION                                                                     
monitor_refine_event.timer                               loaded active waiting Periodic execution of monitor_refine_event.service                              
monitor_refine_event_sanitized_analytics_delayed.timer   loaded active waiting Periodic execution of monitor_refine_event_sanitized_analytics_delayed.service  
monitor_refine_event_sanitized_analytics_immediate.timer loaded active waiting Periodic execution of monitor_refine_event_sanitized_analytics_immediate.service
monitor_refine_event_sanitized_main_delayed.timer        loaded active waiting Periodic execution of monitor_refine_event_sanitized_main_delayed.service       
monitor_refine_event_sanitized_main_immediate.timer      loaded active waiting Periodic execution of monitor_refine_event_sanitized_main_immediate.service     
monitor_refine_eventlogging_analytics.timer              loaded active waiting Periodic execution of monitor_refine_eventlogging_analytics.service             
monitor_refine_eventlogging_legacy.timer                 loaded active waiting Periodic execution of monitor_refine_eventlogging_legacy.service                
monitor_refine_netflow.timer                             loaded active waiting Periodic execution of monitor_refine_netflow.service

These monitor jobs are created by the puppet defined type profile::analytics::refinery::job::refine_job, when refine_monitor_enabled is true.

Most of these timers run on a daily schedule, although the start time is usually specified as per this example.

Our current procedures state that whichever member of our team is on Ops Week we should be responding to these alerts every day. We get an email from a scheuled monitor, which goes to data-engineering-alerts@lists.wikimedia.org.

However, what we have seen recently is that when we have a failed monitor like this, we also get IRC alerts to the #wikmedia-analytics IRC channel from Alertmanager, every 5 or 10 minutes. e.g.

image.png (997×1 px, 498 KB)

We have struggled to find a good alerting mechanism for the use-case of:
does everything look correct for the last 48 hours' worth of data?
... as oppsed to:
is everything OK right now?

Longer term, we plan to migrate our refinery jobs and some other miscellaneous systemd timers to Airflow, where we have greater responsibility for the alerting mechanism (currently email as well).
Until we manage to do that, we're seeking some kind of solution that will help keep our IRC channel manageable.

Event Timeline

FWIW the monitor_refine_* are a safety step for noticing that something is wrong. The refine_* jobs themselves email a more specific alert about something going wrong when they run, and we usually use those to resolve a specific problem.

For IRC spam, I think it would be fine to turn off IRC reporting of monitor_refine_* alerts, if that is possible.

For IRC spam, I think it would be fine to turn off IRC reporting of monitor_refine_* alerts, if that is possible.

OK, thanks. I'll have a look and see what that would take.

I believe when we migrate Refine to Airflow the number of alerts will diminish.
One reason for that is that with Airflow we can configure a more robust retry approach, that should hopefully(1) reduce the number of alerts by a lot, since most of them are fixed nowadays by a manual retry.
Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now. And this will hopefully(3) better pinpoint/track the refine errors and eliminate the need for the monitor completely.
I feel like migrating Refine is probably the most valuable next step in the Airflow migration project.

Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now.

Thanks @mforns. Could you elaborate on this a little for me, please?

Will the refine job be running once per hour (per partition?), with a sensor for when the source data is ready?

I feel like migrating Refine is probably the most valuable next step in the Airflow migration project.

Sounds good to me. 👍

Will the refine job be running once per hour (per partition?), with a sensor for when the source data is ready?

Yup, that's what it does now, except the custom scheduling bit (RefineTarget) is used to launch all the necessary hourly dataset Refine jobs all at once. The scheduling bit is what we want to move to airflow.

In T337052, @BTullis wrote:

Until we manage to do that, we're seeking some kind of solution that will help keep our IRC channel manageable.

With my Observability hat on, I recommend not sending warnings on IRC (or at least not on a main channel with a wide audience). The ops week person will be able to check warnings at https://alerts.wikimedia.org at their leisure though!

I realize that's a bigger scope solution to this task's problem, however my rationale is the following:

  • Folks' attention is precious, and alert fatigue is real
  • If the alert is a warning then it can wait, if it can't wait then it should be a critical or page

hope that helps!

In T337052#8865357, @mforns wrote:
Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now.

Thanks @mforns. Could you elaborate on this a little for me, please?

Of course (sorry I let this one slip).

Currently Refine runs every hour over a window of 44 hours before *now*. This is so, to account for late arrival of data.
But it also means that when an hourly partition fails refinement, it fails every time the partition falls within the Refine window.
So it can be that we get lots of alerts for the same partition.
I'm not super sure if that's the case, because I ignore the case when Refine runs but there's a failure flag in a given partition.
I don't knwo if it tries to re-run it anyway, or it skips it.

In Airflow, we should be able to avoid the window concept, and deal with late arrival in some other way.
This means, each failure will correspond with a red box in the Airflow UI and only alert once.

But it also means that when an hourly partition fails refinement, it fails every time the partition falls within the Refine window.

Not quite!

There is a _FAILURE flag that gets written, and by default previous failures are excluded from the next refinement. That's why we have to manually rerun those with --ignore-failure-flag=true. We only get one alert per Refine attempt of an hour (unless someone reruns and it fails again).

What does keep alerting is the RefineMonitor. RefineMonitor also uses a past window search, but looks for anything that hasn't been refined, or has failed to be refined, and alerts on that. This is a safeguard we added to make sure that we don't miss an individual email alert.

Once we have airflow we won't need RefineMonitor, because Airflow will keep state for each individual dataset that needs to be refined, and we can alert on that.

There is a _FAILURE flag that gets written, and by default previous failures are excluded from the next refinement. That's why we have to manually rerun those with --ignore-failure-flag=true. We only get one alert per Refine attempt of an hour (unless someone reruns and it fails again).

Gotcha! sorry for misleading...

Gehel triaged this task as Medium priority.Oct 18 2023, 8:56 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.