Page MenuHomePhabricator

Write a cookbook to set a k8s cluster in maintenance mode
Open, MediumPublic

Description

This will become handy for further kubernetes updates where we might just go the "reinit-route" again.

This cookbook should:

This is a follow-up of T277191 and T277741

Event Timeline

JMeybohm triaged this task as Medium priority.Mar 17 2021, 4:58 PM
JMeybohm created this task.

Question about the scope of the cookbook - do we want to aggregate functionalities already present in other cookbooks into a single one, or is it ok to just implement what's left out?

I can see from the description multiple things already done by cookbooks:

  • downtime all services (in the cluster, for its DC)
  • route all service traffic to "the other DC"
  • downtime all masters and nodes

The remaining would be:

  • take care of the "other" downtimes if possible (BGP, Prometheus, etc..)
  • depool nodes from pybal (I guess this means the kubemaster and kubesvc` endpoints?) - is it really needed?

Anything else that people think it is worth adding?

Question about the scope of the cookbook - do we want to aggregate functionalities already present in other cookbooks into a single one, or is it ok to just implement what's left out?

I can see from the description multiple things already done by cookbooks:

  • downtime all services (in the cluster, for its DC)
  • route all service traffic to "the other DC"
  • downtime all masters and nodes

I think the goal is to only have to call this one cookbook but it can (and should) call the others wherever it makes sense.

The remaining would be:

  • take care of the "other" downtimes if possible (BGP, Prometheus, etc..)
  • depool nodes from pybal (I guess this means the kubemaster and kubesvc` endpoints?) - is it really needed?

I don't recall exactly but as I put it here I think it was firing the last time we did this. :-)

Anything else that people think it is worth adding?

We now have a couple of prometheus alert rules for k8s stuff and it would ofc. be nice if those where not firing as well (like for calico, cert-manager, API latency etc.). We can probably issue a silence for everything in/on that cluster.

Change 869182 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.k8s.maintenance: add missing admin reason

https://gerrit.wikimedia.org/r/869182

Change 869182 merged by Elukey:

[operations/cookbooks@master] sre.k8s.maintenance: fix missing admin reason

https://gerrit.wikimedia.org/r/869182

Change 869236 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: update SAL/log description and add comments

https://gerrit.wikimedia.org/r/869236

Change 869269 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.discovery.service-route: refactor to base/runner classes

https://gerrit.wikimedia.org/r/869269

I had a chat with Janis, and this is what I am going to do:

  1. Refactor where possible re.k8s.pool-depool-cluster and sre.discovery.service-route to add better logging etc.. (especially in SAL). We should also try to figure out what do to when depooling active/passive services (they are tricky and there are some corner cases that we don't want to get into). One idea is to depool/pool freely active/active svcs, and then emit a warning for the operator (with commands ready to use) when encountering active/passive services.
  2. Focus on silencing as much as possible alarms in the maintenance cookbook, rather than other actions (depooling nodes from kubesvc seems not needed if we silence the pybal alert etc..)

Change 869771 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: handle active/passive services

https://gerrit.wikimedia.org/r/869771

Change 869236 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: update SAL/log description and add comments

https://gerrit.wikimedia.org/r/869236

Change 869269 merged by Elukey:

[operations/cookbooks@master] sre.discovery.service-route: refactor to base/runner classes

https://gerrit.wikimedia.org/r/869269

Change 870926 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.discovery.service-route: fix bugs

https://gerrit.wikimedia.org/r/870926

Change 870926 merged by Elukey:

[operations/cookbooks@master] sre.discovery.service-route: fix bugs

https://gerrit.wikimedia.org/r/870926

Current status:

  • sre.discovery.service-route (used by sre.k8s.pool-depool-cluster) has been moved to the class architecture and tested (I tried to check, depool, pool one DC for inference.discovery.wmnet). Note: if somebody tries to depool the only active dc of a discovery record the cookbook will raise an exception (for safety).
  • https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/869771 is still pending for sre.k8s.pool-depool-cluster, to have a safer handling of active/passive datacenters.

Next steps:

Change 869771 merged by Elukey:

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: handle active/passive services

https://gerrit.wikimedia.org/r/869771

In the meantime we have created two cookbook:

  • sre.k8s.upgrade-cluster.py
  • sre.k8s.wipe-cluster.py

Following up for silences, especially the ones paging in production (ProbeDown).

  • ProbeDown: the most effective way to silence is to get a list of service IPs and then issue silences for alertname=ProbeDown and address=<ip>.
  • JobUnavailable: a bunch of swagger_check alerts fired, though I'm not sure we can do very much ATM for those
  • PyBal backends health check: this is trickier also because it is an icinga alert. I'm hoping we can replace this with a per-backend alert so we're able to issue individual silences (T320627)
akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam (2 of them in fact).

Volans subscribed.

I've spoken with the people involved, and the original request has been merged into the upgrade cluster cookbook. What's left is to improve the silencing in the cookbook to avoid unnecessary alerts/pages during the upgrades.
The work on this will be resumed at the next round of k8s cluster upgrades most likely.