Page MenuHomePhabricator

Create a cookbook for depooling one or all services from one kubernetes cluster
Open, MediumPublic

Description

We want to have a cookbook we can use to completely depool all services running on a kubernetes cluster from the discovery system, or to depool just one.

A cli I could imagine would be:

# Depool a cluster from traffic for one or more services
# This will need to check that the service is available in some other cluster before acting
$ cookbook sre.k8s.depool-service <k8s-cluster> [SVC1,SVC2,...]
# Pool a cluster for one or more services
$ cookbook sre.k8s.pool-service <k8s-cluster> [SVC1,...]
# Check status of traffic to services
$ cookbook.sre.k8s.check-service-route [SVC1,...]

Event Timeline

fgiunchedi triaged this task as Medium priority.Aug 19 2020, 9:05 AM

The pool/depool logic is quite the same as in sre.discovery.pool/depool I guess.

For checking the service availability on some other cluster we could probably query the LVS monitoring commands from icinga or can we just rely on the fact that the other DC/cluster is pooled in confctl?

What do you expect sre.k8s.check-service-route output to be? Something like:

codfw: true
eqiad: false

The pool/depool logic is quite the same as in sre.discovery.pool/depool I guess.

For checking the service availability on some other cluster we could probably query the LVS monitoring commands from icinga or can we just rely on the fact that the other DC/cluster is pooled in confctl?

What do you expect sre.k8s.check-service-route output to be? Something like:

codfw: true
eqiad: false

What we should actually do is something similar to what switchdc does when it changes discovery records.

Specifically,

check-route should read an expected state from etcd, then query the authdns servers and report discrepancies.
pool/depool should change the value in etcd, verify it's changed on the authdns servers, then purge the resolver caches (optionally)

btw, given some of the functionality is common, I would say we should revisit the CLI I proposed above:

cookbook sre.k8s.service-route ACTION [ACTION_ARGS]

Available actions:

pool all|eqiad|codfw SVC1,... pool the listed services in all datacenters, or in a single one
check [SVC1,...] returns the current routes for the listed services

example of output for check:

$ cookbook sre.k8s.service-route check mobileapps mathoid
Expected routes:
mathoid: all
mobileapps: eqiad

In case of problems (i.e. what is in etcd is not in all the authdns servers), an error should be reported.

Claiming as I would like us to have this for helmfile migration.

Change 621721 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/cookbooks@master] sre.k8s.service-route: New cookbook to pool/depool k8s servies

https://gerrit.wikimedia.org/r/621721

As this is not k8s specific I decided to refactor sre.discovery instead of generating a new cookbook.
We can create an alias (via symlink) if we want to have that as sre.k8s.service-route as well.

The EDNS client subnet aware DNS functions are currently part of the cookbook but should be refactored into spicerack dns and dnsdisc after the DC switchover.

Change 621721 merged by jenkins-bot:
[operations/cookbooks@master] sre.discovery: Refactor

https://gerrit.wikimedia.org/r/621721

Merged the current version as is but the cookbook should be updated in short term with:

[03.09.20 14:05] <_joe_> in the meantime I cooked up a general solution to the dns name -> discovery record correspondence for another cookbook
[03.09.20 14:05] <_joe_> we can just adopt it here in a followup patch
[03.09.20 14:05] <jayme> sweet. Is it already in spicerack then?
[03.09.20 14:09] <_joe_> yes
[03.09.20 14:09] <_joe_> sre.switchdc.services

The cookbook does not seem to work (tried during the kubernetes codfw reinit):

  • It did not allow multiple services as arguments
  • It did not actually depool services from a DC