Write a cookbook to set a k8s cluster in maintenance mode
Open, MediumPublic
Actions

Description

This will become handy for further kubernetes updates where we might just go the "reinit-route" again.

This cookbook should:

downtime all services (in the cluster, for its DC) (T277740)
route all service traffic to "the other DC" (T260663)
downtime all masters and nodes
schedule various other downtimes, for like:
- PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers acrab.codfw.wmnet are marked down but pooled. Use https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1016&service=PyBal+backends+health+check and https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1015&service=PyBal+backends+health+check
- PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus Use https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=k8s+cache#
- PROBLEM - Confd template for /srv/config-master/pybal/eqiad/... on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/... is broken. Use https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal# (this needs to be downtimed for both DCs!)
depool all nodes from pybal

This is a follow-up of T277191 and T277741

Details

Subject	Repo	Branch	Lines +/-
sre.k8s.pool-depool-cluster: handle active/passive services	operations/cookbooks	master	+56 -2
sre.discovery.service-route: fix bugs	operations/cookbooks	master	+6 -2
sre.discovery.service-route: refactor to base/runner classes	operations/cookbooks	master	+148 -102
sre.k8s.pool-depool-cluster: update SAL/log description and add comments	operations/cookbooks	master	+11 -1
sre.k8s.maintenance: fix missing admin reason	operations/cookbooks	master	+4 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	JMeybohm	T307943 Update Kubernetes clusters to v1.23
Open	None	T341984 Update Kubernetes clusters to >1.25
Open	elukey	T277677 Write a cookbook to set a k8s cluster in maintenance mode
Resolved	JMeybohm	T260663 Create a cookbook for depooling one or all services from one kubernetes cluster

Event Timeline

JMeybohm triaged this task as Medium priority.Mar 17 2021, 4:58 PM

JMeybohm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 17 2021, 4:58 PM

JMeybohm mentioned this in T277191: Update Kubernetes cluster codfw to kubernetes 1.16.Mar 17 2021, 5:00 PM

JMeybohm updated the task description. (Show Details)Mar 18 2021, 9:37 AM

JMeybohm updated the task description. (Show Details)Mar 24 2021, 9:16 AM

JMeybohm claimed this task.Apr 29 2021, 2:38 PM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

JMeybohm mentioned this in T251305: Migrate to helm v3.Nov 9 2021, 8:47 AM

Jelto added a project: Sustainability (Incident Followup).Nov 30 2021, 9:00 AM

Joe mentioned this in T300879: Add a kubernetes module to spicerack.Feb 3 2022, 4:04 PM

JMeybohm removed JMeybohm as the assignee of this task.Feb 3 2022, 5:13 PM

akosiaris added a subtask: T260663: Create a cookbook for depooling one or all services from one kubernetes cluster.May 4 2022, 2:29 PM

JMeybohm mentioned this in T307943: Update Kubernetes clusters to v1.23.May 9 2022, 5:10 PM

JMeybohm closed subtask T260663: Create a cookbook for depooling one or all services from one kubernetes cluster as Resolved.Sep 13 2022, 8:56 AM

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:23 PM

JMeybohm added a parent task: T307943: Update Kubernetes clusters to v1.23.Nov 18 2022, 9:26 AM

Question about the scope of the cookbook - do we want to aggregate functionalities already present in other cookbooks into a single one, or is it ok to just implement what's left out?

I can see from the description multiple things already done by cookbooks:

downtime all services (in the cluster, for its DC)
route all service traffic to "the other DC"
downtime all masters and nodes

The remaining would be:

take care of the "other" downtimes if possible (BGP, Prometheus, etc..)
depool nodes from pybal (I guess this means the kubemaster and kubesvc` endpoints?) - is it really needed?

Anything else that people think it is worth adding?

In T277677#8459708, @elukey wrote:

Question about the scope of the cookbook - do we want to aggregate functionalities already present in other cookbooks into a single one, or is it ok to just implement what's left out?

I can see from the description multiple things already done by cookbooks:

downtime all services (in the cluster, for its DC)

route all service traffic to "the other DC"

downtime all masters and nodes

I think the goal is to only have to call this one cookbook but it can (and should) call the others wherever it makes sense.

In T277677#8459708, @elukey wrote:

The remaining would be:

take care of the "other" downtimes if possible (BGP, Prometheus, etc..)

depool nodes from pybal (I guess this means the kubemaster and kubesvc` endpoints?) - is it really needed?

I don't recall exactly but as I put it here I think it was firing the last time we did this. :-)

In T277677#8459708, @elukey wrote:

Anything else that people think it is worth adding?

We now have a couple of prometheus alert rules for k8s stuff and it would ofc. be nice if those where not firing as well (like for calico, cert-manager, API latency etc.). We can probably issue a silence for everything in/on that cluster.

elukey claimed this task.Dec 14 2022, 8:52 AM

Change 869182 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.k8s.maintenance: add missing admin reason

https://gerrit.wikimedia.org/r/869182

gerritbot added a project: Patch-For-Review.Dec 19 2022, 10:27 AM

Change 869182 merged by Elukey:

[operations/cookbooks@master] sre.k8s.maintenance: fix missing admin reason

https://gerrit.wikimedia.org/r/869182

Maintenance_bot removed a project: Patch-For-Review.Dec 19 2022, 11:30 AM

Change 869236 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: update SAL/log description and add comments

https://gerrit.wikimedia.org/r/869236

gerritbot added a project: Patch-For-Review.Dec 19 2022, 3:57 PM

Change 869269 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.discovery.service-route: refactor to base/runner classes

https://gerrit.wikimedia.org/r/869269

I had a chat with Janis, and this is what I am going to do:

Refactor where possible re.k8s.pool-depool-cluster and sre.discovery.service-route to add better logging etc.. (especially in SAL). We should also try to figure out what do to when depooling active/passive services (they are tricky and there are some corner cases that we don't want to get into). One idea is to depool/pool freely active/active svcs, and then emit a warning for the operator (with commands ready to use) when encountering active/passive services.
Focus on silencing as much as possible alarms in the maintenance cookbook, rather than other actions (depooling nodes from kubesvc seems not needed if we silence the pybal alert etc..)

Change 869771 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: handle active/passive services

https://gerrit.wikimedia.org/r/869771

Change 869236 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: update SAL/log description and add comments

https://gerrit.wikimedia.org/r/869236

Change 869269 merged by Elukey:

[operations/cookbooks@master] sre.discovery.service-route: refactor to base/runner classes

https://gerrit.wikimedia.org/r/869269

Change 870926 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.discovery.service-route: fix bugs

https://gerrit.wikimedia.org/r/870926

Change 870926 merged by Elukey:

[operations/cookbooks@master] sre.discovery.service-route: fix bugs

https://gerrit.wikimedia.org/r/870926

Current status:

sre.discovery.service-route (used by sre.k8s.pool-depool-cluster) has been moved to the class architecture and tested (I tried to check, depool, pool one DC for inference.discovery.wmnet). Note: if somebody tries to depool the only active dc of a discovery record the cookbook will raise an exception (for safety).
https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/869771 is still pending for sre.k8s.pool-depool-cluster, to have a safer handling of active/passive datacenters.

Next steps:

Merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/869771 and do basic tests.
Create a maintenance cookbook that calls sre.k8s.pool-depool-cluster and also that silences all the aforementioned alerts).

Change 869771 merged by Elukey:

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: handle active/passive services

https://gerrit.wikimedia.org/r/869771

Maintenance_bot removed a project: Patch-For-Review.Jan 27 2023, 2:31 PM

In the meantime we have created two cookbook:

sre.k8s.upgrade-cluster.py
sre.k8s.wipe-cluster.py

Following up for silences, especially the ones paging in production (ProbeDown).

ProbeDown: the most effective way to silence is to get a list of service IPs and then issue silences for alertname=ProbeDown and address=<ip>.
JobUnavailable: a bunch of swagger_check alerts fired, though I'm not sure we can do very much ATM for those
PyBal backends health check: this is trickier also because it is an icinga alert. I'm hoping we can replace this with a per-backend alert so we're able to issue individual silences (T320627)

Removing SRE, has already been triaged to a more specific SRE subteam (2 of them in fact).

I've spoken with the people involved, and the original request has been merged into the upgrade cluster cookbook. What's left is to improve the silencing in the cookbook to avoid unnecessary alerts/pages during the upgrades.
The work on this will be resumed at the next round of k8s cluster upgrades most likely.

Volans moved this task from Backlog to E_TOO_BIG_MAYBE_OKR? on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 20 2023, 11:47 AM

JMeybohm added a parent task: T341984: Update Kubernetes clusters to >1.25.Jul 17 2023, 12:26 PM

akosiaris mentioned this in T356661: Cross fleet runc upgrades.Feb 12 2024, 4:47 PM

Write a cookbook to set a k8s cluster in maintenance modeOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Write a cookbook to set a k8s cluster in maintenance mode
Open, MediumPublic
Actions

Related Objects
Search...