Page MenuHomePhabricator

Ensure that the data-engineering team is alerted to all relevant host and service checks from Icinga
Closed, ResolvedPublic

Description

We have had a couple of incidents recently whereby the Data Engineering team missed critical service failures from Icinga, because the alerts only went to the #wikimedia-operations IRC channel.

Examples include:

  • disk space checks - (e.g. /var/lib/hadoop/journal on some hadoop worker nodes and / on kafka-test brokers)
  • kafka broker checks (under-replicated partitions, broker process etc)

There will be many more like these.

At present the only contact group associated with these checks is admins for whom the only contact is irc

We need to configure the hosts and services correctly so that analytics is also added where appropriate.

We should also be mindful of the migration to Alertmanager and the desire to create tickets from some alerts e.g. T225140

Event Timeline

Change 804593 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the analytics contact group to all relevant hosts in icinga

https://gerrit.wikimedia.org/r/804593

BTullis triaged this task as Medium priority.Jun 10 2022, 2:32 PM
BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

Change 804593 merged by Btullis:

[operations/puppet@production] Add the analytics contact group to all relevant hosts in icinga

https://gerrit.wikimedia.org/r/804593

This change is now merged and I have tested that it has resulted in both hosts and services having the analytics contact group applied.
For example, here is the disk space check of an-worker1081 which is one of our journalnode hosts.

image.png (342×1 px, 69 KB)