Page MenuHomePhabricator

spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters
Open, MediumPublic

Description

spicerack.dnsdisc.Discovery.pool() does not check if the service is actually poolable in the specified datacenter. Active/passive services are pooled even if they are already pooled in the other DC.

Event Timeline

JMeybohm triaged this task as Medium priority.Aug 18 2022, 1:47 PM
JMeybohm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Wasn't that the required behaviour to allow to failover an active/passive service without downtime?
At least that's what I recall and seems to be also what was done in the cookbook:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/77d2482175f73d2e821e93e6fa796b72dd764f26/cookbooks/sre/switchdc/mediawiki/04-switch-mediawiki.py#25

Ah, okay. Makes sense in that case. I think I was assuming pool() would check because depool() does so as well (and fails if it can't depool). For the case you mentioned I would have expected something like a force=True parameter do override sanity checks. But that's probably nothing that can easily be changed now.

Context if this is me accidentally pooling a active/passive service with sre.discovery.service-route because I assumed this (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/discovery/service-route.py#105) would just DTRT.

I had a chat with @JMeybohm on IRC and we went over some of the options/possibilities that we have here.

From one side we need to keep the current use case of failing over an a/p service without downtime. The current gdnsd/confd/conftool setup kinda requires to make the service active in both DCs (noop) and then depool one DC (at this time the failover is perfomed).
This is possible because the current confd setup prevents from pooling the second DC with a failure that is also exposed as a generated file that is then checked by Icinga. So the current workflow is:

  • pool the inactive DC
  • depool the active DC
  • delete the failed confd file on all the A:dns-auth hosts

That is of course sub-optimal as it will still allow to have an a/p service pooled in both DCs in confctl (etcd).

I think that we have three options here (in order of complexity):

  1. Leave things as they are
  2. Modify Spicerack to prevent to pool an a/p service unless some force=True parameter is passed explicitely. At the same time we could refactor some of the API to be consistent and more strict. For example I can also see the use case of wanting to force depool a service from all DCs to make it fail hard and go to failoid (to prevent data corruption or during a severe security incident). Basically revisiting all the pool/depool behaviours.
  3. Modify the whole workflow to make it more strict and explicit, preventing inconsistent data to live in confctl and modifying Spicerack accordingly. Probably moving the current stopgap from confd to somewhere else.

Thoughts?

@JMeybohm Is this something still needed?

Not ultimately required as nothing depends on this. IIRC it was just an observation of mine that felt odd/dangerous at the time of writing a cookbook (or maybe I broke something). I'd opt for 2) (as that's what I would have expected from the API I guess), but I can also live with 1). - sorry for not making this clear.