The Data-Engineering team currently has some monitoring workflows that are currently at-odds with the way we alert about systemd unit failures.
There's certainly room for improvement in our ops week processes which we use when responding to these failures, but I'm wondering if there is any other solution that can help us to reduce the amount of IRC spam in particular that we see from these monitors.
For context, we have a number of systemd timers that run on an-launcher1002 which are used to detect data quality issues on the HDFS file system.
btullis@an-launcher1002:~$ systemctl list-units monitor* UNIT LOAD ACTIVE SUB DESCRIPTION monitor_refine_event.timer loaded active waiting Periodic execution of monitor_refine_event.service monitor_refine_event_sanitized_analytics_delayed.timer loaded active waiting Periodic execution of monitor_refine_event_sanitized_analytics_delayed.service monitor_refine_event_sanitized_analytics_immediate.timer loaded active waiting Periodic execution of monitor_refine_event_sanitized_analytics_immediate.service monitor_refine_event_sanitized_main_delayed.timer loaded active waiting Periodic execution of monitor_refine_event_sanitized_main_delayed.service monitor_refine_event_sanitized_main_immediate.timer loaded active waiting Periodic execution of monitor_refine_event_sanitized_main_immediate.service monitor_refine_eventlogging_analytics.timer loaded active waiting Periodic execution of monitor_refine_eventlogging_analytics.service monitor_refine_eventlogging_legacy.timer loaded active waiting Periodic execution of monitor_refine_eventlogging_legacy.service monitor_refine_netflow.timer loaded active waiting Periodic execution of monitor_refine_netflow.service
These monitor jobs are created by the puppet defined type profile::analytics::refinery::job::refine_job, when refine_monitor_enabled is true.
Most of these timers run on a daily schedule, although the start time is usually specified as per this example.
Our current procedures state that whichever member of our team is on Ops Week we should be responding to these alerts every day. We get an email from a scheuled monitor, which goes to data-engineering-alerts@lists.wikimedia.org.
However, what we have seen recently is that when we have a failed monitor like this, we also get IRC alerts to the #wikmedia-analytics IRC channel from Alertmanager, every 5 or 10 minutes. e.g.
We have struggled to find a good alerting mechanism for the use-case of:
does everything look correct for the last 48 hours' worth of data?
... as oppsed to:
is everything OK right now?
Longer term, we plan to migrate our refinery jobs and some other miscellaneous systemd timers to Airflow, where we have greater responsibility for the alerting mechanism (currently email as well).
Until we manage to do that, we're seeking some kind of solution that will help keep our IRC channel manageable.