We have had a couple of incidents recently whereby the Data Engineering team missed critical service failures from Icinga, because the alerts only went to the #wikimedia-operations IRC channel.
Examples include:
- disk space checks - (e.g. /var/lib/hadoop/journal on some hadoop worker nodes and / on kafka-test brokers)
- kafka broker checks (under-replicated partitions, broker process etc)
There will be many more like these.
At present the only contact group associated with these checks is admins for whom the only contact is irc
We need to configure the hosts and services correctly so that analytics is also added where appropriate.
We should also be mindful of the migration to Alertmanager and the desire to create tickets from some alerts e.g. T225140