Page MenuHomePhabricator

[wikireplicas] Route alerts to WMCS team
Open, HighPublic

Description

Alerts for Wiki Replicas hosts (clouddb*) are currently routed to the data-persistence team together with the alerts for production hosts (db*).

Alerts are defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/master/team-data-persistence

Replication lag alerts were disabled for clouddb hosts in https://gerrit.wikimedia.org/r/c/operations/alerts/+/835117, by filtering based on job: mysql-core.

We should instead route all the alerts (including the ones for replication lag) to team=wmcs if job: mysql-labs.

Open questions:

  • Sanitariums also are tagged with job: mysql-labs, should they be tagged with job: mysql-core instead?
  • clouddbs are currently getting tagged with cluster: mysql in hieradata/regex.yaml, should they be tagged with cluster: wmcs instead? This would include them in the WMCS NodeDown alerts.

Event Timeline

fnegri triaged this task as High priority.Dec 5 2024, 4:00 PM

clouddbs are currently getting tagged with cluster: mysql

There are actually 4 different Hiera tags we should check for clouddb hosts (see T375673: Define single hiera key to identify WMCS-managed bare metal hosts):

  • profile::admin::groups: this is set to wikireplicas-roots and it can stay like that (see T344599: wikireplicas root access)
  • contactgroups: I suggest we set it to wmcs-team-email
  • cluster: I suggest we set it to wmcs
  • profile::contacts::role_contacts: this is currently set to ['Data Platform','WMCS'], we should set it to WMCS only

I'm not entirely sure about Icinga alerts for clouddb* hosts. As far as I understand it:

  • emails and pages sent by Icinga are routed based on the value of contactgroups (current value: admins, I think)
  • alerts that are forwarded from Icinga to Alertmanager are tagged with team=wmcs based on the regex in icinga_exporter.yaml

We discussed this today with @Marostegui and @joanna_borun and we agreed that the following alerts should exist and should be routed to the WMCS team for clouddb* hosts:

  • mysql down
  • replication broken
  • replication lag greater than x

All these alerts should be critical but not paging.