Page MenuHomePhabricator

Move EXCLUDED_SERVICES attribute from sre.discovery.datacentre to service catalog
Open, LowPublic

Description

We use service.yaml in Puppet as a top-level definition of all of our major services, including their discovery status and other attributes like monitoring and paging. In the sre.discovery.datacentre cookbook we have a hand-maintained list of services that we should not switch over for various reasons.

This list should probably be moved out of the cookbook to somewhere more clearly maintained, and also in a manner that service owners can clearly understand and control themselves.

To do this we will need to

  • Add support for the attribute to service.yaml
  • Copy the existing excluded services over, along with the comments
  • Add support for the attribute to spicerack.service.ServiceDiscoveryRecord
  • Replace EXCLUDED_SERVICES in datacenter.py

Event Timeline

Change #1217189 had a related patch set uploaded (by Blake; author: Blake):

[operations/puppet@production] service: add exclude_from_switchover field.

https://gerrit.wikimedia.org/r/1217189

Change #1217189 merged by Blake:

[operations/puppet@production] service: add exclude_from_switchover field.

https://gerrit.wikimedia.org/r/1217189

The change which adds the attribute and copies the extant services over (minus ones which no longer exist) has been submitted and merged. Now I'll start the change(s) for Spicerack.

Change #1217520 had a related patch set uploaded (by Blake; author: Blake):

[operations/software/spicerack@master] service: add exclude_from_switchover field.

https://gerrit.wikimedia.org/r/1217520

Change #1217520 merged by Blake:

[operations/software/spicerack@master] service: add exclude_from_switchover field.

https://gerrit.wikimedia.org/r/1217520

I've reverted the change to the service catalog, and will wait to re-submit until the change is in Spicerack. It would be good to enforce this requirement somehow, so we're not relying on people reading a comment or determining via Phabricator archaeology that this ordering is critical.

Luca deployed a new version of Spicerack, and I redeployed the change to the service catalog - things look good. The remaining work is to switch to using the new field, rather than the old python dict, and then to remove the dict.

Hm, it looks like we're also going to need to update https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/services/__init__.py, because it currently relies on the EXCLUDED_SERVICES module constant.

I'm going to spend some time reading to see how it might make sense to use the service catalog here. (Or, if someone has strong opinions about how this ought to be done, please let me know!)

Change #1224041 had a related patch set uploaded (by Blake; author: Blake):

[operations/software/spicerack@master] service: add excluded_services helper function

https://gerrit.wikimedia.org/r/1224041

After some discussion on https://gerrit.wikimedia.org/r/1224041 and a chat earlier on Thursday with @Blake, I wanted to follow up here about the state of the sre.switchdc.services cookbook and deprecation of the EXCLUDED_SERVICES constant.

Background

My understanding is that sre.switchdc.services was historically part of the switchover procedure, but has now been superseded by sre.discovery.datacenter. The latter differs in a number of ways, but is in general a more "batteries included" tool for automating switchover Day 1 (e.g., manages both A/A and A/P services, while the former only handles A/A).

At present, sre.switchdc.services is still referenced in Switch_Datacenter#Special_cases, in order to (1) provide historical context and (2) document its ability to target or exclude individual (A/A) services (which sre.discovery.datacenter lacks).

Deprecation

If we're going to get rid of EXCLUDED_SERVICES, then we need to figure out how to deal with sre.switchdc.services - i.e., either delete the cookbook outright if we no longer think it's useful to maintain or migrate it to using exclude_from_switchover from the catalog in some way.

In a world where it no longer exists, confctl still gives us the ability to target individual services, albeit with more ceremony if we want approximate feature parity with the cookbook - e.g., one could manually execute the shorten TTL -> sleep -> switch -> relax TTL sequence provided by the cookbook, if indeed that's needed.

In a world where it continues to exist, we also need to address the fact that it's subtly incorrect when applying exclusions: it conflates the service-catalog service names in EXCLUDED_SERVICES with DNS discovery service names it reads from /etc/spicerack/cookbooks/sre.switchdc.services.yaml (these are not guaranteed to match, nor even be 1:1). This is perhaps an argument for deletion - i.e., unused code that drifts into an incorrect state is a liability.

My vote is probably to delete, unless anyone happens to recall recently using sre.switchdc.services specifically for the additional functionality it adds around TTL changes.

Based on the discussion in IRC about Scott's analysis above, it seems like we're good to remove this cookbook.

I'll put together a patch to remove it, then update the switchover wiki to no longer reference the removed steps.

Change #1224041 abandoned by Blake:

[operations/software/spicerack@master] service: add excluded_services helper function

Reason:

Will deprecate this cookbook instead.

https://gerrit.wikimedia.org/r/1224041

Change #1224945 had a related patch set uploaded (by Blake; author: Blake):

[operations/cookbooks@master] sre.discovery.datacenter: use service registry for exclusions

https://gerrit.wikimedia.org/r/1224945

Change #1224945 merged by Blake:

[operations/cookbooks@master] sre.discovery.datacenter: use service registry for exclusions

https://gerrit.wikimedia.org/r/1224945

Change #1225500 had a related patch set uploaded (by Blake; author: Blake):

[operations/cookbooks@master] switchdc: Delete services cookbook.

https://gerrit.wikimedia.org/r/1225500

Change #1225500 merged by jenkins-bot:

[operations/cookbooks@master] switchdc: Delete services cookbook.

https://gerrit.wikimedia.org/r/1225500