Page MenuHomePhabricator

setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster
Closed, ResolvedPublic

Description

The migration to etcd3 blocks kubernetes upgrades, as from 1.13 etcd2 is no longer supported as a protocol. Upgrade in place for etcd is not an option really. It's a impossible process as it requires upgrading from 2.2 to 2.3, then to 3.0 and possibly to 3.1 before upgrading to 3.2 (which is what buster and stretch have).

The alternative is to setup 3 VMs and setup a new empty etcd cluster on them, then reinitialize all clusters.

Details

SubjectRepoBranchLines +/-
operations/dnsmaster+0 -12
operations/puppetproduction+1 -4
operations/puppetproduction+1 -36
operations/puppetproduction+4 -1
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+9 -7
operations/puppetproduction+1 -32
operations/puppetproduction+2 -22
operations/puppetproduction+3 -3
operations/puppetproduction+9 -6
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+5 -2
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+15 -0
operations/deployment-chartsmaster+20 -0
operations/puppetproduction+7 -2
operations/deployment-chartsmaster+2 -2
operations/puppetproduction+9 -5
operations/puppetproduction+4 -1
operations/deployment-chartsmaster+2 -2
operations/puppetproduction+32 -33
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 556713 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] etcd: Align better etcdv2 and etcdv3 profiles

https://gerrit.wikimedia.org/r/556713

Change 556713 merged by Alexandros Kosiaris:
[operations/puppet@production] etcd: Align better etcdv2 and etcdv3 profiles

https://gerrit.wikimedia.org/r/556713

All 3 clusters are up and running and healthy. I managed to mistag some changes for another task, here they are:

Reopening, this is more than just getting the VMs and etcd up and running. Gonna track the migration as well in this one

Change 558353 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s: Migrate staging to the new etcd cluster

https://gerrit.wikimedia.org/r/558353

Change 558354 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s: Migrate codfw to the new etcd cluster

https://gerrit.wikimedia.org/r/558354

Change 558355 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s: Migrate eqiad to the new etcd cluster

https://gerrit.wikimedia.org/r/558355

Change 558365 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] calico: Parameterize calico datastore type

https://gerrit.wikimedia.org/r/558365

Process for the migration

  1. Upload puppet changes
  2. Populate IPPool and BGP nodes in the new calico etcd backend
  3. Schedule downtime for
    • apiserver
    • calico-node
    • services
  4. Depool services from discovery/edge caches
  5. Delete all helmfile managed namespaces
  6. Disable puppet on master and nodes
  7. Stop apiserver and calico node on nodes
  8. Merge changes
  9. Start API server
  10. Start calico-node
  11. helmfile init

The calico step above is done via

Get a modified calicoctl.cfg on the kubernetes nodes that references the new etcds

  • calicoctl config set asNumber 64603 --config=calicoctl.cfg
  • calicoctl config set nodeToNodeMesh off --config=calicoctl.cfg
  • calicoctl get -o yaml bgppeer | calicoctl create -f - --config=calicoctl.cfg
  • calicoctl get -o yaml ippool | calicoctl create -f - --config=calicoctl.cfg

Mentioned in SAL (#wikimedia-operations) [2019-12-17T09:15:11Z] <akosiaris> delete all namespaces in kubernetes staging cluster for initialization with etcd3 backing datastore. T239835

Change 558471 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Switch staging calico controller to the new etcd cluster

https://gerrit.wikimedia.org/r/558471

Change 558472 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Switch codfw calico controller to the new etcd cluster

https://gerrit.wikimedia.org/r/558472

Change 558473 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Switch eqiad calico controller to the new etcd cluster

https://gerrit.wikimedia.org/r/558473

Change 558471 merged by jenkins-bot:
[operations/deployment-charts@master] Switch staging calico controller to the new etcd cluster

https://gerrit.wikimedia.org/r/558471

Change 558365 merged by Alexandros Kosiaris:
[operations/puppet@production] calico: Parameterize calico datastore type

https://gerrit.wikimedia.org/r/558365

Change 558353 merged by Alexandros Kosiaris:
[operations/puppet@production] k8s: Migrate staging to the new etcd cluster

https://gerrit.wikimedia.org/r/558353

The process has worked up to a point, I hit a snag, I 've had to edit system:node clusterrolebinding and add

subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes

in order to allow nodes to register with the API. This is something we knew, it's in https://kubernetes.io/docs/reference/access-authn-authz/node/#migration-considerations.

We are unable currently to use the Node authz mode as we don't have a decent way currently to populate per node tokens.

1 more snag.

The "master" tiller, the one that exists in kube-system could not be started because of

Error creating: pods "tiller-deploy-7488d85d85-" is forbidden: no providers available to validate pod request

this is a chicken and egg problem, the solution is to restart the apiserver without the PodSecurityPolicy controller apply via helmfile the policies and then restart kube-apiserver normally. The only sane way out of this is documentation

Change 558546 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] admin: Set dnsPolicy: Default for the calico controller

https://gerrit.wikimedia.org/r/558546

Change 558546 merged by jenkins-bot:
[operations/deployment-charts@master] admin: Set dnsPolicy: Default for the calico controller

https://gerrit.wikimedia.org/r/558546

Change 558547 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] calico::cni: Pass datastore_type as well

https://gerrit.wikimedia.org/r/558547

Change 558689 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] RBAC: Add the tiller cluster role

https://gerrit.wikimedia.org/r/558689

Change 558547 merged by Alexandros Kosiaris:
[operations/puppet@production] calico::cni: Pass datastore_type as well

https://gerrit.wikimedia.org/r/558547

Change 558689 merged by jenkins-bot:
[operations/deployment-charts@master] RBAC: Add the tiller cluster role

https://gerrit.wikimedia.org/r/558689

Change 558705 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] RBAC: Add system:nodes group to system:node

https://gerrit.wikimedia.org/r/558705

Mentioned in SAL (#wikimedia-operations) [2019-12-18T07:53:47Z] <akosiaris> run helmfile sync for all staging deployments T239835

Change 558705 merged by jenkins-bot:
[operations/deployment-charts@master] RBAC: Add system:nodes group to system:node

https://gerrit.wikimedia.org/r/558705

Change 558972 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] rbac: Add default metadata to system:node RBAC

https://gerrit.wikimedia.org/r/558972

Change 558972 merged by jenkins-bot:
[operations/deployment-charts@master] rbac: Add default metadata to system:node RBAC

https://gerrit.wikimedia.org/r/558972

Change 558973 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] admin: don't rely on coredns for kube-system tiller

https://gerrit.wikimedia.org/r/558973

Change 558973 merged by jenkins-bot:
[operations/deployment-charts@master] admin: don't rely on coredns for kube-system tiller

https://gerrit.wikimedia.org/r/558973

Mentioned in SAL (#wikimedia-operations) [2019-12-18T10:00:07Z] <akosiaris> populate new calico stores for codfw T239835

Icinga downtime for 1 day, 0:00:00 set by akosiaris@cumin1001 on 6 host(s) and their services with reason: alex reinit kubernetes cluster

kubernetes[2001-2006].codfw.wmnet

Icinga downtime for 1 day, 0:00:00 set by akosiaris@cumin1001 on 6 host(s) and their services with reason: alex reinit kubernetes cluster

kubetcd[2001-2006].codfw.wmnet

Icinga downtime for 1 day, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: alex reinit kubernetes cluster

acrab.codfw.wmnet

Icinga downtime for 1 day, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: alex reinit kubernetes cluster

acrux.codfw.wmnet

Change 559002 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] cache::text: Depool k8s services

https://gerrit.wikimedia.org/r/559002

Change 558472 merged by jenkins-bot:
[operations/deployment-charts@master] Switch codfw calico controller to the new etcd cluster

https://gerrit.wikimedia.org/r/558472

Change 559002 merged by Alexandros Kosiaris:
[operations/puppet@production] cache::text: Depool k8s services

https://gerrit.wikimedia.org/r/559002

Change 558354 merged by Alexandros Kosiaris:
[operations/puppet@production] k8s: Migrate codfw to the new etcd cluster

https://gerrit.wikimedia.org/r/558354

Mentioned in SAL (#wikimedia-operations) [2019-12-18T15:07:01Z] <akosiaris> repool all codfw k8s services. T239835

staging and codfw have successfully been migrated to etcd3. calico is still on the v2 protocol (however on the same set of hosts), we need an upgrade before that can be migrated as well. eqiad will be done after the holidays to avoid any weird repercussions.

Change 559113 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove roles from old k8s etcd hosts

https://gerrit.wikimedia.org/r/559113

Change 559113 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove roles from old k8s etcd hosts

https://gerrit.wikimedia.org/r/559113

Mentioned in SAL (#wikimedia-operations) [2019-12-18T16:21:41Z] <akosiaris> remove kubestagetcd100{1,2,3} from the fleet T239835

Change 565253 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove etcd100[456] from site.pp

https://gerrit.wikimedia.org/r/565253

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: etcd[1004-1006].eqiad.wmnet

  • etcd1004.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • etcd1005.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • etcd1006.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 565253 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove etcd100[456]

https://gerrit.wikimedia.org/r/565253

Mentioned in SAL (#wikimedia-operations) [2020-01-16T11:27:33Z] <akosiaris> delete etcd100{4,5,6} from ganeti01.svc.eqiad.wmnet. T239835

Mentioned in SAL (#wikimedia-operations) [2020-01-16T11:27:41Z] <akosiaris> delete etcd100{4,5,6} from netbox. T239835

cookbooks.sre.hosts.decommission executed by volans@cumin2001 for hosts: kubetcd2001.codfw.wmnet

  • kubetcd2001.codfw.wmnet (FAIL)
    • Host steps raised exception: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by volans@cumin2001 for hosts: kubetcd2001.codfw.wmnet

  • kubetcd2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: kubetcd[2002-2003].codfw.wmnet

  • kubetcd2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
  • kubetcd2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

@akosiaris and me put together a more precise step by step plan about what needs to be done for the outstanding eqiad migration.

Current plan is to do the migration on 2020-09-08 at 08:00 UTC

  1. Upload puppet changes
    1. Done already:
      1. https://gerrit.wikimedia.org/r/c/operations/puppet/+/558355
      2. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/558473
  2. Populate IPPool and BGP nodes in the new calico etcd backend
    1. Verify if this is already done
    2. calicoctl config set asNumber 64603 --config=calicoctl.cfg
    3. calicoctl config set nodeToNodeMesh off --config=calicoctl.cfg
    4. calicoctl get -o yaml bgppeer | calicoctl create -f - --config=calicoctl.cfg
    5. calicoctl get -o yaml ippool | calicoctl create -f - --config=calicoctl.cfg
  3. Schedule downtime for
    1. apiserver
    2. calico-node
    3. services
  4. Depool services from discovery/edge caches
    1. Should be already done with switchover, doublecheck
  5. Delete all helmfile managed namespaces (to be sure we see errors/missing things early)
  6. Disable puppet on master and k8s nodes
  7. Stop apiserver and calico node on k8s nodes
  8. Merge puppet changes
  9. Enable and run puppet on the k8s nodes
  10. Enable puppet on 1 apiserver and run it
  11. Disable puppet on apiserver again
  12. Edit /etc/default/kube-apiserver to disable PodSecurityPolicy controller {snag2}
  13. Start API server (running without PodSecurityPolicy controller now)
  14. Run deployment-chars/helmfile.d/admin/initialize_cluster.sh for eqiad
  15. Edit system:node clusterrolebinding {snag1}
    1. We can probably add that to initialize_cluster.sh
  16. Enable puppet again and run it. This will restart API server with PodSecurityPolicy controller
  17. Run helmfile.d/admin/eqiad/cluster-helmfile.sh
  18. Deploy all services via a for loop and helmfile sync commands
  19. Clean up / decomission old etcd cluster

Mentioned in SAL (#wikimedia-operations) [2020-09-08T09:20:40Z] <jayme> disabling puppted on argon.eqiad.wmnet,chlorine.eqiad.wmnet,kubernetes[1001-1016].eqiad.wmnet - Reinitialize eqiad k8s cluster with new etcd - T239835

Icinga downtime for 4:00:00 set by jayme@cumin1001 on 18 host(s) and their services with reason: Reinitialize eqiad k8s cluster with new etcd

argon.eqiad.wmnet,chlorine.eqiad.wmnet,kubernetes[1001-1016].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-09-08T09:43:32Z] <akosiaris> stopped calico-node and kube-apiserver on k8s nodes/masters T239835

Change 558355 merged by JMeybohm:
[operations/puppet@production] k8s: Migrate eqiad to the new etcd cluster

https://gerrit.wikimedia.org/r/558355

Change 558473 merged by jenkins-bot:
[operations/deployment-charts@master] Switch eqiad calico controller to the new etcd cluster

https://gerrit.wikimedia.org/r/558473

Mentioned in SAL (#wikimedia-operations) [2020-09-08T09:52:56Z] <akosiaris> enable puppet, run it on all k8s eqiad nodes and double check that calico-node is fine T239835

Process completed yesterday and the (new) cluster looks fine.
I moved the documentation of this process to https://wikitech.wikimedia.org/wiki/Kubernetes#Reinitialize_a_complete_cluster (currently missing the catch 22 fix @akosiaris applied to be able to do first deploys via tiller).

Whats left IMHO is decommission of etcd100[1-3]

Change 626274 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Remove etcd100[123] hosts

https://gerrit.wikimedia.org/r/626274

Change 626337 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Remove etcd100[123] hosts

https://gerrit.wikimedia.org/r/626337

Icinga downtime for 7 days, 0:00:00 set by jayme@cumin1001 on 3 host(s) and their services with reason: shutting down, host sheduled for decommission

etcd[1001-1003].eqiad.wmnet

VMs have been shut down as of now. Will decommission on Wednesday if nothing pops up.

FYI backups on these hosts etcd[1001-1003] are still configured on bacula, and failing to run as the hosts are down (until puppet facts are cleared).

I don't see an immediate replacement for configured backups, in case that would be necessary or desired for the service on the new location.

Change 627647 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Ignore failures on backing up etcd1* hosts

https://gerrit.wikimedia.org/r/627647

Change 627647 merged by Jcrespo:
[operations/puppet@production] backups: Ignore failures on backing up etcd1* hosts

https://gerrit.wikimedia.org/r/627647

Change 627619 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert "backups: Ignore failures on backing up etcd1* hosts"

https://gerrit.wikimedia.org/r/627619

^this should be merged before closing this ticket :-)

cookbooks.sre.hosts.decommission executed by jayme@cumin1001 for hosts: etcd[1001-1003].eqiad.wmnet

  • etcd1001.eqiad.wmnet (WARN)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Site eqiad DNS records not yet migrated to the automatic system, manual patch required
  • etcd1002.eqiad.wmnet (WARN)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Site eqiad DNS records not yet migrated to the automatic system, manual patch required
  • etcd1003.eqiad.wmnet (WARN)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Site eqiad DNS records not yet migrated to the automatic system, manual patch required

Change 626274 merged by JMeybohm:
[operations/puppet@production] Remove etcd100[123] hosts

https://gerrit.wikimedia.org/r/626274

Change 627619 merged by JMeybohm:
[operations/puppet@production] Revert "backups: Ignore failures on backing up etcd1* hosts"

https://gerrit.wikimedia.org/r/627619

Change 626337 merged by JMeybohm:
[operations/dns@master] Remove etcd100[123] hosts

https://gerrit.wikimedia.org/r/626337

JMeybohm added a subscriber: Volans.

Hosts decommissioned, puppet, ignore_list and dns clean.
Thanks @jcrespo and @Volans for support!