switchdc should automatically downtime "Read only" checks on DB masters being switched
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Jun 29 2021, 7:44 PM

Description

During the DC switchover, we downtime the "read only" alerts on the DB masters so they don't page. This step is currently manual and but given that @Kormat used a cookbook to set the downtimes, it seems like something that should be automated instead.

Details

Subject	Repo	Branch	Lines +/-
sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries	operations/cookbooks	master	+49 -1
icinga: Add downtime_services and remove_service_downtimes	operations/software/spicerack	master	+315 -7
icinga: Tweak --services API	operations/puppet	production	+6 -6
icinga: Add --services flag to icinga-status	operations/puppet	production	+27 -5
icinga: Write to Icinga command file instead of calling icinga-downtime	operations/software/spicerack	master	+132 -40
icinga: Performance improvements to icinga-status	operations/puppet	production	+64 -52

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved	Clement_Goubert	T328770 March 2023 Datacenter Switchover Blockers
Resolved	Legoktm	T287539 September 2021 Datacenter switchover (codfw -> eqiad)
Resolved	RLazarus	T285803 switchdc should automatically downtime "Read only" checks on DB masters being switched

Event Timeline

Legoktm created this task.Jun 29 2021, 7:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 29 2021, 7:44 PM

Legoktm mentioned this in T281515: June 2021 Datacenter switchover.Jun 30 2021, 1:52 AM

Volans subscribed.Jun 30 2021, 7:47 AM

To clarify, there's no way to only downtime a specific icinga check across multiple machines (that i know of). I used sre.hosts.downtime against A:db-role-master for 1h, and then kept an eye on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|es\|pc\)[12]&style=detail&servicestatustypes=29 so that any unexpected alerts would be noticed.

Marostegui triaged this task as Medium priority.Jul 5 2021, 7:37 AM

Marostegui moved this task from Triage to Refine on the DBA board.

The downtime cookbook uses icinga-downtime under the hood to write to Icinga's external commands file. If we wanted, we could add an optional -s flag to that script, to have it write a SCHEDULE_SVC_DOWNTIME command downtiming a specific service. Then the downtime cookbook can fire it off for each host in a cumin query.

That would give us a tidy way to downtime only the read_only service, without having to fully downtime all the DB sources during a tricky maneuver. The only downside is that it's an icinga-specific solution, so we may have to solve it again at some point during the alertmanager transition -- not sure of the timeline. I'm inclined to go ahead, though -- thoughts?

In T285803#7203653, @RLazarus wrote:

The downtime cookbook uses icinga-downtime under the hood to write to Icinga's external commands file. If we wanted, we could add an optional -s flag to that script, to have it write a SCHEDULE_SVC_DOWNTIME command downtiming a specific service. Then the downtime cookbook can fire it off for each host in a cumin query.

That would give us a tidy way to downtime only the read_only service, without having to fully downtime all the DB sources during a tricky maneuver. The only downside is that it's an icinga-specific solution, so we may have to solve it again at some point during the alertmanager transition -- not sure of the timeline. I'm inclined to go ahead, though -- thoughts?

That sounds really useful! That's functionality that would be great to have for other cases, such as schema changes etc.

If I may add my 2 cents, icinga-downtime is a very simple and old bash script that only ensures that a given host is defined in icinga before trying to downtime it to make it fail in case it doesn't because the Icinga "api" to write to the command file doesn't give you any feedback, is a fire and forget approach. It predates spicerack and the cookbooks.

The spicerack's Icinga module uses icinga-downtime only to fully downtime a host, for all other actions it writes directly to the command file.

We currently have a better interface to the icinga current status, that is icinga-status, hence my suggestion would be:

to add to icinga_status.py a flag to return all services (not only the failed ones it does right now) for the given hosts
add to the IcingaHosts a couple of methods like downtime_services(services: List[str]) and the related context manager services_downtimed.
- they will get the current status of the hosts
- consider if we need to support regexes in the above methods to match the services and/or allow for some dynamic placeholder, like {host} for the hostname and {site} for the host's site (the latter we don't have it in that moment, but could be added.)
- fail if the service name is not present in the hosts's services, replicating the current functionality of icinga-downtime for hosts
- write directly to the Icinga command file like we already do for the other actions

Adding @jbond for additional insights as he was heavily involved in the icinga-status development.

LSobanski subscribed.Jul 12 2021, 11:16 AM

Oh, that does sound better! I didn't realize icinga-status was out there, thanks for the pointer. Your plan sounds good to me, especially if @jbond is happy with it too. I'll start by refactoring the icinga module to write to the command file directly (for host downtimes, no service functionality yet) and we can go from there.

Claiming this for now, but more than happy to hand it off, if anyone wants it.

In T285803#7206669, @RLazarus wrote:

I'll start by refactoring the icinga module to write to the command file directly (for host downtimes, no service functionality yet)

I'm not sure if that bit is actually needed, the current implementation, however not elegant, is much quicker than parsing the icinga status file and gives a quick feedback that the host we're downtiming exists.

The icinga module is already writing directly to the command file for all other actions, so all be machinery is already there for you ;)

Change 704422 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] icinga: Performance improvements to icinga-status

https://gerrit.wikimedia.org/r/704422

gerritbot added a project: Patch-For-Review.Jul 13 2021, 8:59 PM

Change 704422 merged by RLazarus:

[operations/puppet@production] icinga: Performance improvements to icinga-status

https://gerrit.wikimedia.org/r/704422

Maintenance_bot removed a project: Patch-For-Review.Jul 15 2021, 12:10 AM

Change 705500 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/software/spicerack@master] icinga: Write to Icinga command file instead of calling icinga-downtime

https://gerrit.wikimedia.org/r/705500

gerritbot added a project: Patch-For-Review.Jul 19 2021, 8:40 PM

Change 705500 merged by RLazarus:

[operations/software/spicerack@master] icinga: Write to Icinga command file instead of calling icinga-downtime

https://gerrit.wikimedia.org/r/705500

Maintenance_bot removed a project: Patch-For-Review.Jul 26 2021, 3:11 PM

Change 708384 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] icinga: Add --services flag to icinga-status

https://gerrit.wikimedia.org/r/708384

gerritbot added a project: Patch-For-Review.Jul 28 2021, 12:30 AM

Change 708384 merged by RLazarus:

[operations/puppet@production] icinga: Add --services flag to icinga-status

https://gerrit.wikimedia.org/r/708384

Maintenance_bot removed a project: Patch-For-Review.Jul 28 2021, 4:11 PM

Legoktm added a parent task: T287539: September 2021 Datacenter switchover (codfw -> eqiad).Aug 4 2021, 9:24 PM

Change 710121 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] icinga: Tweak --services API

https://gerrit.wikimedia.org/r/710121

gerritbot added a project: Patch-For-Review.Aug 4 2021, 10:55 PM

Change 710121 merged by RLazarus:

[operations/puppet@production] icinga: Tweak --services API

https://gerrit.wikimedia.org/r/710121

Maintenance_bot removed a project: Patch-For-Review.Aug 13 2021, 8:10 PM

LSobanski moved this task from Refine to Blocked external/Not db team on the DBA board.Aug 30 2021, 2:27 PM

Change 718935 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/software/spicerack@master] icinga: Add downtime_services and remove_service_downtimes

https://gerrit.wikimedia.org/r/718935

gerritbot added a project: Patch-For-Review.Sep 6 2021, 1:26 AM

Change 718936 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/cookbooks@master] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries

https://gerrit.wikimedia.org/r/718936

Change 718935 merged by jenkins-bot:

[operations/software/spicerack@master] icinga: Add downtime_services and remove_service_downtimes

https://gerrit.wikimedia.org/r/718935

Is this still relevant, does it need to be finished for T327920: March 2023 Datacenter Switchover, or can it be closed?

We really need this to be completed yes. I don't know in which state this is at the moment.

Clement_Goubert mentioned this in T328770: March 2023 Datacenter Switchover Blockers.Feb 3 2023, 1:29 PM

I've rebased and implemented one of @Volans recommandation on the CR that had already been created by @RLazarus
Would love to have your eyeballs on it Data-Persistence

Aklapper added a parent task: T328770: March 2023 Datacenter Switchover Blockers.Feb 6 2023, 1:14 AM

Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Feb 6 2023, 10:11 AM

Change 718936 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries

https://gerrit.wikimedia.org/r/718936

Maintenance_bot removed a project: Patch-For-Review.Feb 6 2023, 12:30 PM

Downtime part dry-runs correctly. I will reopen if I hit issues in the live-test.

Clement_Goubert closed this task as Resolved.Feb 6 2023, 12:38 PM

switchdc should automatically downtime "Read only" checks on DB masters being switched Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

switchdc should automatically downtime "Read only" checks on DB masters being switched
Closed, ResolvedPublic
Actions

Related Objects
Search...