Page MenuHomePhabricator

Get aux-k8s cluster row-redundant and with more workers
Closed, ResolvedPublic

Description

Currently all cluster componentes (etcd, control planes, workers) are in the same ganeti group which is very suboptimal:

$ sudo gnt-instance list -o name,pnode.group | grep aux
aux-k8s-ctrl1001.eqiad.wmnet        A
aux-k8s-ctrl1002.eqiad.wmnet        A
aux-k8s-etcd1001.eqiad.wmnet        A
aux-k8s-etcd1002.eqiad.wmnet        A
aux-k8s-etcd1003.eqiad.wmnet        A
aux-k8s-worker1001.eqiad.wmnet      A
aux-k8s-worker1002.eqiad.wmnet      A

This obviously affects availability of the cluster and can affect (and has!) deployments with podAntiAffinity rules requiring row-diverse pod placement (like calico-typha).

Event Timeline

fgiunchedi renamed this task from Get aux-k8s cluster row-redundant to Get aux-k8s cluster row-redundant and with more workers.Aug 15 2023, 9:37 AM
fgiunchedi updated the task description. (Show Details)

Change 949002 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Remove podAntiAffinity for calico-typha on aux

https://gerrit.wikimedia.org/r/949002

Change 949002 merged by Filippo Giunchedi:

[operations/deployment-charts@master] Remove podAntiAffinity for calico-typha on aux

https://gerrit.wikimedia.org/r/949002

To be able to deploy calico changes we dropped the anti affinity rule from the typha deployment (T333302) this should be undone when the aux cluster is row redundant.

joanna_borun triaged this task as Low priority.

Came across this task with working on some K8s tasks. I know this is discouraged, but could we try https://wikitech.wikimedia.org/wiki/Ganeti#Renumber_(aka_change_network)_a_VM with one VM (maybe one of the control plane nodes or one of the etcd ones) and see how it goes?

Proposal for a first step, that doesn't involve renumbering:

  1. Create aux-k8s-ctrl1003.eqiad.wmnet (same config as the other two ctrl nodes).
  2. Create aux-k8s-worker1003.eqiad.wmnet (same config as the other two worker nodes).
  3. Add both of them to the cluster, and make sure everything works as expected.
  4. Drop aux-k8s-ctrl1001.eqiad.wmnet and aux-k8s-worker1001.eqiad.wmnet from the k8s config, and delete any trace of those VMs.

We should be able to unblock the typha adffinity rules and have a better redundancy for the cluster. The etcd cluster may stay as it is, to be re-created when we'll upgrade to a newer version of k8s. Thoughts?

Change #1076679 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add aux-k8s-{ctrl,worker}1003 to AUX K8s

https://gerrit.wikimedia.org/r/1076679

Change #1076681 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add aux-k8s-ctrl1003 to admin_ng's config for AUX

https://gerrit.wikimedia.org/r/1076681

Change #1076681 merged by Elukey:

[operations/deployment-charts@master] admin_ng: drop unused aux-k8s control plane IPs

https://gerrit.wikimedia.org/r/1076681

Change #1076679 merged by Elukey:

[operations/puppet@production] Add aux-k8s-{ctrl,worker}1003 to AUX K8s

https://gerrit.wikimedia.org/r/1076679

New status:

elukey@ganeti1028:~$ sudo gnt-instance list -o name,pnode.group | grep aux
aux-k8s-ctrl1001.eqiad.wmnet        A
aux-k8s-ctrl1002.eqiad.wmnet        A
aux-k8s-ctrl1003.eqiad.wmnet        D
aux-k8s-etcd1001.eqiad.wmnet        A
aux-k8s-etcd1002.eqiad.wmnet        A
aux-k8s-etcd1003.eqiad.wmnet        A
aux-k8s-worker1001.eqiad.wmnet      A
aux-k8s-worker1002.eqiad.wmnet      A
aux-k8s-worker1003.eqiad.wmnet      B

@CDanis I'd remove aux-k8s-ctrl1001.eqiad.wmnet and aux-k8s-worker1001.eqiad.wmnet if you are ok, and then possibly try to figure out how to add 2 more etcd vms (to expand the cluster to 5 and then drop aux-k8s-etcd1001.eqiad.wmnet and aux-k8s-etcd1002.eqiad.wmnet for example). Does it make sense? Do we also want to add more workers?

depool host aux-k8s-worker1001.eqiad.wmnet by elukey@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by elukey@cumin1002 depool for host aux-k8s-worker1001.eqiad.wmnet completed:

  • aux-k8s-worker1001.eqiad.wmnet (PASS)
    • Host aux-k8s-worker1001.eqiad.wmnet depooled from aux-k8s-eqiad

depool host aux-k8s-ctrl1001.eqiad.wmnet by elukey@cumin1002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by elukey@cumin1002 depool for host aux-k8s-ctrl1001.eqiad.wmnet completed:

  • aux-k8s-ctrl1001.eqiad.wmnet (PASS)
    • Host aux-k8s-ctrl1001.eqiad.wmnet depooled from aux-k8s-eqiad

Drained the 1001 nodes in row A, current status:

root@deploy2002:~# kubectl get nodes
NAME                             STATUS                     ROLES           AGE    VERSION
aux-k8s-ctrl1001.eqiad.wmnet     Ready,SchedulingDisabled   control-plane   592d   v1.23.14
aux-k8s-ctrl1002.eqiad.wmnet     Ready                      control-plane   592d   v1.23.14
aux-k8s-ctrl1003.eqiad.wmnet     Ready                      control-plane   19h    v1.23.14
aux-k8s-worker1001.eqiad.wmnet   Ready,SchedulingDisabled   <none>          592d   v1.23.14
aux-k8s-worker1002.eqiad.wmnet   Ready                      <none>          592d   v1.23.14
aux-k8s-worker1003.eqiad.wmnet   Ready                      <none>          19h    v1.23.14

If nothing pops up I'll just decom those VMs tomorrow.

For etcd, I think that we can just add one node at the time via https://wikitech.wikimedia.org/wiki/Etcd#Adding_a_new_member_to_the_cluster, make sure the cluster-health is good, and then drop aux-k8s-etcd100[1,2] via the delete member API.

Change #1079534 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add aux-k8s-etcd1004 in service

https://gerrit.wikimedia.org/r/1079534

Change #1079535 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add aux-k8s-etcd1005 in service

https://gerrit.wikimedia.org/r/1079535

Change #1079539 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add aux-k8s-etcd1004 to the aux-k8s SRV records

https://gerrit.wikimedia.org/r/1079539

Change #1079540 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add aux-k8s-etcd1005 to the Aux k8s SRV records

https://gerrit.wikimedia.org/r/1079540

Procedure to expand etcd from 3 to 5:

At this point we should have a 5 nodes cluster, fully working and healthy. If all good, we'll be able to drop 1001 and 1002 via https://wikitech.wikimedia.org/wiki/Etcd#Removing_a_member_from_the_cluster

I've tried to add some documentation on how to add/remove control planes to https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_control-planes please extend if you feel like there is anything missing there ❤️

Change #1079534 merged by Elukey:

[operations/puppet@production] Add aux-k8s-etcd1004 in service

https://gerrit.wikimedia.org/r/1079534

Change #1079539 merged by Elukey:

[operations/dns@master] Add aux-k8s-etcd1004 to the aux-k8s SRV records

https://gerrit.wikimedia.org/r/1079539

Change #1079540 merged by Elukey:

[operations/dns@master] Add aux-k8s-etcd1005 to the Aux k8s SRV records

https://gerrit.wikimedia.org/r/1079540

Change #1079535 merged by Elukey:

[operations/puppet@production] Add aux-k8s-etcd1005 in service

https://gerrit.wikimedia.org/r/1079535

Mentioned in SAL (#wikimedia-operations) [2024-10-14T12:09:48Z] <elukey> increase etcd k8s aux cluster from 3 -> 5 - T344230

Change #1079999 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Remove aux-k8x-{ctrl,worker}1001 from production

https://gerrit.wikimedia.org/r/1079999

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: aux-k8s-ctrl1001.eqiad.wmnet

  • aux-k8s-ctrl1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: aux-k8s-worker1001.eqiad.wmnet

  • aux-k8s-worker1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change #1080011 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kubernetes: change the AUX etcd urls nodes

https://gerrit.wikimedia.org/r/1080011

Change #1079999 merged by Elukey:

[operations/puppet@production] Remove aux-k8x-{ctrl,worker}1001 from production

https://gerrit.wikimedia.org/r/1079999

Change #1080011 merged by Elukey:

[operations/puppet@production] kubernetes: change the AUX etcd urls nodes

https://gerrit.wikimedia.org/r/1080011

Next steps:

  • Reduce the etcd cluster from 5 to 3
  • Remove the following

To be able to deploy calico changes we dropped the anti affinity rule from the typha deployment (T333302) this should be undone when the aux cluster is row redundant.

Change #1080016 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Remove aux-k8s-etcd100[1,2] from the AUX client SRV records

https://gerrit.wikimedia.org/r/1080016

Change #1080017 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Remove aux-k8s-etcd1001 from the AUX cluster's SRV records

https://gerrit.wikimedia.org/r/1080017

Change #1080018 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Remove aux-k8s-etcd1002 from the AUX cluster's SRV records

https://gerrit.wikimedia.org/r/1080018

Change #1080016 merged by Elukey:

[operations/dns@master] Remove aux-k8s-etcd100[1,2] from the AUX client SRV records

https://gerrit.wikimedia.org/r/1080016

Change #1080017 merged by Elukey:

[operations/dns@master] Remove aux-k8s-etcd1001 from the AUX cluster's SRV records

https://gerrit.wikimedia.org/r/1080017

Change #1080018 merged by Elukey:

[operations/dns@master] Remove aux-k8s-etcd1002 from the AUX cluster's SRV records

https://gerrit.wikimedia.org/r/1080018

Change #1080022 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Remove aux-k8s-etcd100[1,2] from production

https://gerrit.wikimedia.org/r/1080022

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: aux-k8s-etcd1001.eqiad.wmnet

  • aux-k8s-etcd1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: aux-k8s-etcd1002.eqiad.wmnet

  • aux-k8s-etcd1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Way better now!

elukey@ganeti1028:~$ sudo gnt-instance list -o name,pnode.group | grep aux
aux-k8s-ctrl1002.eqiad.wmnet        A
aux-k8s-ctrl1003.eqiad.wmnet        D
aux-k8s-etcd1003.eqiad.wmnet        A
aux-k8s-etcd1004.eqiad.wmnet        D
aux-k8s-etcd1005.eqiad.wmnet        C
aux-k8s-worker1002.eqiad.wmnet      A
aux-k8s-worker1003.eqiad.wmnet      B

Change #1080022 merged by Elukey:

[operations/puppet@production] Remove aux-k8s-etcd100[1,2] from production

https://gerrit.wikimedia.org/r/1080022