Page MenuHomePhabricator

Database read_only alert has a changing service description
Closed, ResolvedPublic

Description

In https://gerrit.wikimedia.org/r/718936 we tried to automatically downtime the read-only checks as part of the DC switchover, by matching the service name.

Quoting @Kormat:

This will only match the icinga 'service' in the active DC (i.e. the one we're switching away from). The inactive DC doesn't have #page in the service name; that changes when we run puppet.

Quoting @Volans (emphasis added):

If the check changes name they are two different checks of which only one exists at any given time. If that's the case we can't downtime the non-existing one (#page in the passive datacenter) in advance [...]

Having checks that change description (that is the unique identifier for icinga) based on puppet runs based on some live-state status should be something to avoid.

I can see a few ways forward:

  • Keep everything as-is, i.e. with #page in the active DC and without #page in the passive. The advantage is that a read-write state in the passive DC is arguably less urgent than the other way around (since MediaWiki wouldn't send updates there anyway) so we avoid unnecessary paging alerts.
  • Use a consistent service description with #page; i.e. have the alert page in both DCs. The advantage is that we don't change the description during the switchover (effectively creating two new services). In addition to being a little more conceptually reasonable, it gives us functional improvements like the ability to maintain a service downtime during the switchover.
  • Keep both services in both DCs: that is, each DC has a service like {section} should be read-only but is read-write and a service like {section} should be read-write but is read-only #page. That has the advantages of both approaches above, but maybe it looks a little strange to have an extra service that can't alert. (I'm assuming that's feasible -- I haven't actually looked at the implementation of the check.)

I think I favor option 3 but maybe there's a better approach.

Event Timeline

I like option #3 too, but I also have to say that option #1 is also quite important. Having the passive DC in read-write is very dangerous and could cause lots of headaches if for some reason something gets inserted there. I consider a split brain worse than having the active DC as read-only for a period of time, in fact, I'd have the active DC as read-only by mistake than the passive DC as read-write.

Luckily, in general MW would not try to insert on the passive DC but we've seen attempts to write on the passive, especially from wikitech maintenance scripts or webhosts, which are "expected" apparently and only stopped by the read-only flag set on mysql itself.

I would love to hear other opinions!

Ad for the feasibility, I thought about some ways to achieve more or less what you want without having alerts that change name based on puppet runs. None of those is clean or pretty given the current limitations.

In no particular order:

  • setup both checks always, with and without #page, the one with #page executes a small wrapper that reads the expected replication configuration, runs the check and then adapt the exit code based on the active DC and replication config, making the #page be a WARNING instead of a CRITICAL where it should not page.
  • setup only the non-#page alerts as active alert in Icinga (like all the alerts) and then have the #page version of them setup as passive alerts (like the fr-tech ones) that will be populated either by a script that runs somewhere or more simply by an icinga event handler triggered by the non-#page alert so that it can decide when the #page one should go critical and actually page.
  • [very hacky] use a dedicated notify_by_* commands in modules/nagios_common/templates/notification_commands.cfg.erb that runs a script that knows about the replication logic and does notify by IRC or page and could add/remove the #page tag. It's very hacky because totally magic and transparent for the user and will create confusion in the future.
  • investigate with o11y if alertmanager might be of any help here

This is totally independent on where the replication expected status lives, it can be puppet or elsewhere. To me it seems that it's more of a live-state that should probably not be hardcoded in puppet, but that might be out of scope for this task.

LSobanski triaged this task as Medium priority.Sep 9 2021, 7:30 AM
LSobanski moved this task from Triage to Refine on the DBA board.

Note that this isn't the only alert that works like this for DBs. We also have:

  • MariaDB Replica IO:
  • MariaDB Replica Lag:
  • MariaDB Replica SQL:
  • mysqld processes

These do not have #p.age on the inactive DC.

I like option #3 too, but I also have to say that option #1 is also quite important. Having the passive DC in read-write is very dangerous and could cause lots of headaches if for some reason something gets inserted there. I consider a split brain worse than having the active DC as read-only for a period of time, in fact, I'd have the active DC as read-only by mistake than the passive DC as read-write.

Luckily, in general MW would not try to insert on the passive DC but we've seen attempts to write on the passive, especially from wikitech maintenance scripts or webhosts, which are "expected" apparently and only stopped by the read-only flag set on mysql itself.

Hmm. Ok, it sounds like for this specific (read-only) alert, we should page in both DCs. I'll send a CR.

The other 4 alerts though currently have the required paging/non-paging behaviour.

Change 719948 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Page for read-only status issues in both DCs

https://gerrit.wikimedia.org/r/719948

Change 719948 merged by Kormat:

[operations/puppet@production] mariadb: Page for read-only status issues in both DCs

https://gerrit.wikimedia.org/r/719948

Kormat claimed this task.

Ok, the read_only alert now happens to have #p.age for both DCs, so the initial reason for raising this task is now addressed. As for the general approach of having alerts that change name depending on whether they should page or not, that's an artifact of our icinga setup & conventions. Prometheus doesn't have this issue, so it becomes moot as we (slowly) migrate over. As such, i'm going to close this task now.