Page MenuHomePhabricator

switchdc should automatically downtime "Read only" checks on DB masters being switched
Closed, ResolvedPublic

Description

During the DC switchover, we downtime the "read only" alerts on the DB masters so they don't page. This step is currently manual and but given that @Kormat used a cookbook to set the downtimes, it seems like something that should be automated instead.

Event Timeline

To clarify, there's no way to only downtime a specific icinga check across multiple machines (that i know of). I used sre.hosts.downtime against A:db-role-master for 1h, and then kept an eye on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29 so that any unexpected alerts would be noticed.

Marostegui triaged this task as Medium priority.Jul 5 2021, 7:37 AM
Marostegui moved this task from Triage to Refine on the DBA board.

The downtime cookbook uses icinga-downtime under the hood to write to Icinga's external commands file. If we wanted, we could add an optional -s flag to that script, to have it write a SCHEDULE_SVC_DOWNTIME command downtiming a specific service. Then the downtime cookbook can fire it off for each host in a cumin query.

That would give us a tidy way to downtime only the read_only service, without having to fully downtime all the DB sources during a tricky maneuver. The only downside is that it's an icinga-specific solution, so we may have to solve it again at some point during the alertmanager transition -- not sure of the timeline. I'm inclined to go ahead, though -- thoughts?

The downtime cookbook uses icinga-downtime under the hood to write to Icinga's external commands file. If we wanted, we could add an optional -s flag to that script, to have it write a SCHEDULE_SVC_DOWNTIME command downtiming a specific service. Then the downtime cookbook can fire it off for each host in a cumin query.

That would give us a tidy way to downtime only the read_only service, without having to fully downtime all the DB sources during a tricky maneuver. The only downside is that it's an icinga-specific solution, so we may have to solve it again at some point during the alertmanager transition -- not sure of the timeline. I'm inclined to go ahead, though -- thoughts?

That sounds really useful! That's functionality that would be great to have for other cases, such as schema changes etc.

If I may add my 2 cents, icinga-downtime is a very simple and old bash script that only ensures that a given host is defined in icinga before trying to downtime it to make it fail in case it doesn't because the Icinga "api" to write to the command file doesn't give you any feedback, is a fire and forget approach. It predates spicerack and the cookbooks.

The spicerack's Icinga module uses icinga-downtime only to fully downtime a host, for all other actions it writes directly to the command file.

We currently have a better interface to the icinga current status, that is icinga-status, hence my suggestion would be:

  • to add to icinga_status.py a flag to return all services (not only the failed ones it does right now) for the given hosts
  • add to the IcingaHosts a couple of methods like downtime_services(services: List[str]) and the related context manager services_downtimed.
    • they will get the current status of the hosts
    • consider if we need to support regexes in the above methods to match the services and/or allow for some dynamic placeholder, like {host} for the hostname and {site} for the host's site (the latter we don't have it in that moment, but could be added.)
    • fail if the service name is not present in the hosts's services, replicating the current functionality of icinga-downtime for hosts
    • write directly to the Icinga command file like we already do for the other actions

Adding @jbond for additional insights as he was heavily involved in the icinga-status development.

Oh, that does sound better! I didn't realize icinga-status was out there, thanks for the pointer. Your plan sounds good to me, especially if @jbond is happy with it too. I'll start by refactoring the icinga module to write to the command file directly (for host downtimes, no service functionality yet) and we can go from there.

Claiming this for now, but more than happy to hand it off, if anyone wants it.

I'll start by refactoring the icinga module to write to the command file directly (for host downtimes, no service functionality yet)

I'm not sure if that bit is actually needed, the current implementation, however not elegant, is much quicker than parsing the icinga status file and gives a quick feedback that the host we're downtiming exists.

The icinga module is already writing directly to the command file for all other actions, so all be machinery is already there for you ;)

Change 704422 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] icinga: Performance improvements to icinga-status

https://gerrit.wikimedia.org/r/704422

Change 704422 merged by RLazarus:

[operations/puppet@production] icinga: Performance improvements to icinga-status

https://gerrit.wikimedia.org/r/704422

Change 705500 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/software/spicerack@master] icinga: Write to Icinga command file instead of calling icinga-downtime

https://gerrit.wikimedia.org/r/705500

Change 705500 merged by RLazarus:

[operations/software/spicerack@master] icinga: Write to Icinga command file instead of calling icinga-downtime

https://gerrit.wikimedia.org/r/705500

Change 708384 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] icinga: Add --services flag to icinga-status

https://gerrit.wikimedia.org/r/708384

Change 708384 merged by RLazarus:

[operations/puppet@production] icinga: Add --services flag to icinga-status

https://gerrit.wikimedia.org/r/708384

Change 710121 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] icinga: Tweak --services API

https://gerrit.wikimedia.org/r/710121

Change 710121 merged by RLazarus:

[operations/puppet@production] icinga: Tweak --services API

https://gerrit.wikimedia.org/r/710121

Change 718935 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/software/spicerack@master] icinga: Add downtime_services and remove_service_downtimes

https://gerrit.wikimedia.org/r/718935

Change 718936 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/cookbooks@master] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries

https://gerrit.wikimedia.org/r/718936

Change 718935 merged by jenkins-bot:

[operations/software/spicerack@master] icinga: Add downtime_services and remove_service_downtimes

https://gerrit.wikimedia.org/r/718935

Is this still relevant, does it need to be finished for T327920: March 2023 Datacenter Switchover, or can it be closed?

We really need this to be completed yes. I don't know in which state this is at the moment.

I've rebased and implemented one of @Volans recommandation on the CR that had already been created by @RLazarus
Would love to have your eyeballs on it Data-Persistence

Change 718936 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries

https://gerrit.wikimedia.org/r/718936

Downtime part dry-runs correctly. I will reopen if I hit issues in the live-test.