Page MenuHomePhabricator

Co-locate kube-apiserver and etcd on new staging control plane nodes
Closed, ResolvedPublic

Description

In order to test migrating the wikikube control planes to hardware nodes and colocate with etcd we should migrate the wikikube clusters from the current setup with 2 control planes + 3 etcd VMs to 3 VMs that co-locate kube-apiserver and etcd.

This will be useful for testing new puppet roles as well as the procedure and will ensure that staging and prod clusters are set up the same way as prod (which is required so we can catch errors early during future k8s upgrades etc.).

  • Add two additional new stacked masters to staging-codfw
  • Remove the old control-planes and etcd nodes (delete VMs etc)
  • Repeat the process for staging-eqiad

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+6 -4
operations/puppetproduction+1 -37
operations/puppetproduction+2 -20
operations/puppetproduction+0 -5
operations/dnsmaster+0 -6
operations/puppetproduction+0 -15
operations/dnsmaster+1 -0
operations/dnsmaster+1 -0
operations/dnsmaster+0 -2
operations/puppetproduction+11 -6
operations/dnsmaster+3 -0
operations/puppetproduction+0 -5
operations/puppetproduction+1 -0
operations/puppetproduction+31 -46
operations/deployment-chartsmaster+0 -7
operations/puppetproduction+14 -13
operations/dnsmaster+0 -8
operations/puppetproduction+7 -12
operations/puppetproduction+6 -0
operations/puppetproduction+4 -5
operations/dnsmaster+2 -0
operations/puppetproduction+5 -2
operations/dnsmaster+2 -0
operations/puppetproduction+5 -5
operations/puppetproduction+4 -0
operations/puppetproduction+20 -6
operations/puppetproduction+2 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/dnsmaster+1 -0
operations/puppetproduction+81 -4
operations/puppetproduction+30 -7
operations/puppetproduction+4 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1024543 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubestagemaster2003: Add as insetup::serviceops

https://gerrit.wikimedia.org/r/1024543

Change #1024543 merged by JMeybohm:

[operations/puppet@production] kubestagemaster2003: Add as insetup::serviceops

https://gerrit.wikimedia.org/r/1024543

Change #1025278 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master: Add stacked control plane option

https://gerrit.wikimedia.org/r/1025278

Change #1025295 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master: Add stacked control plane option

https://gerrit.wikimedia.org/r/1025295

Change #1025278 abandoned by JMeybohm:

[operations/puppet@production] kubernetes::master: Add stacked control plane option

Reason:

in favor of Ie59a97fd1934e219c5d1f20fe1186b8135f7118d

https://gerrit.wikimedia.org/r/1025278

JMeybohm updated the task description. (Show Details)

Change #1025278 restored by JMeybohm:

[operations/puppet@production] kubernetes::master: Add stacked control plane option

https://gerrit.wikimedia.org/r/1025278

Change #1025295 abandoned by JMeybohm:

[operations/puppet@production] kubernetes::master: Add stacked control plane option

Reason:

In favor of I758b43e5e523b7f258a59a3bbbeb92a19c2850f0

https://gerrit.wikimedia.org/r/1025295

Change #1025278 merged by JMeybohm:

[operations/puppet@production] kubernetes::master: Add stacked control plane option

https://gerrit.wikimedia.org/r/1025278

Change #1025397 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add kubestage2003 to staging-codfw and conftool

https://gerrit.wikimedia.org/r/1025397

Change #1025399 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add kubestagemaster2003 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1025399

Change #1025399 merged by JMeybohm:

[operations/dns@master] Add kubestagemaster2003 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1025399

Change #1025397 merged by JMeybohm:

[operations/puppet@production] Add kubestage2003 to staging-codfw and conftool

https://gerrit.wikimedia.org/r/1025397

Change #1025404 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Fix copy/paste error for kubestagemaster2003

https://gerrit.wikimedia.org/r/1025404

Change #1025404 merged by JMeybohm:

[operations/puppet@production] Fix copy/paste error for kubestagemaster2003

https://gerrit.wikimedia.org/r/1025404

I've added the new, stacked control-plan with some manual intervention as etcd did not come up initially which makes kube-apiserver-safe-restart wait forever for to acquire as lock.

Seems like etcd got started before the certificates where available (and not restarted after):

Apr 29 15:29:08 kubestagemaster2003 etcd[33145]: open /var/lib/etcd/ssl/etcd___etcd-server-ssl__tcp_k8s3-staging_codfw_wmnet.chained.pem: no such file or directory  
Apr 29 15:29:08 kubestagemaster2003 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE  
Apr 29 15:29:08 kubestagemaster2003 systemd[1]: etcd.service: Failed with result 'exit-code'.  
Apr 29 15:29:08 kubestagemaster2003 systemd[1]: Failed to start etcd - highly-available key value store.

After manual restart I figured that I should have added the new control-plane to the etcd server SRV record before running puppet:

Apr 29 15:44:46 kubestagemaster2003 etcd[45706]: error setting up initial cluster: cannot find local etcd member "kubestagemaster2003" in SRV records  
Apr 29 15:44:46 kubestagemaster2003 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE  
Apr 29 15:44:46 kubestagemaster2003 systemd[1]: etcd.service: Failed with result 'exit-code'.  
Apr 29 15:44:46 kubestagemaster2003 systemd[1]: Failed to start etcd - highly-available key value store.

After doing that, the waiting puppet process finished successfully.

Now I'm blocked by T345823: Wikikube staging clusters are out of IPv4 Pod IP's as there is no free IPv4 pool for the new control-plane to claim, so I need to remove one of the existing ones.

Change #1025422 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] etcd: Notify etcd on PKI cert generation and reneval

https://gerrit.wikimedia.org/r/1025422

Change #1025422 abandoned by JMeybohm:

[operations/puppet@production] etcd: Notify etcd on PKI cert generation and reneval

Reason:

AIUI etcd does reload the Cert from disk with every client connection, so there is no need to restart (https://github.com/etcd-io/etcd/commit/4e21f87e3d014d606bb3ba2a89731a7d24806611). But we still need a way to describe the relation between the pki cert and the service.

https://gerrit.wikimedia.org/r/1025422

Change #1025690 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cfssl::cert: Add before_services parameter

https://gerrit.wikimedia.org/r/1025690

Change #1025690 merged by JMeybohm:

[operations/puppet@production] cfssl::cert: Add before_services parameter

https://gerrit.wikimedia.org/r/1025690

Change #1030955 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add kubestagemaster200[45] as insetup::serviceops

https://gerrit.wikimedia.org/r/1030955

Change #1030957 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add kubestagemaster2004 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1030957

Change #1030958 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add kubestagemaster2004 to staging-codfw conftool

https://gerrit.wikimedia.org/r/1030958

Change #1030955 merged by JMeybohm:

[operations/puppet@production] Add kubestagemaster200[45] as insetup::serviceops

https://gerrit.wikimedia.org/r/1030955

Change #1030995 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add stacked kubernetes masters to appropriate aliases

https://gerrit.wikimedia.org/r/1030995

Change #1030995 merged by JMeybohm:

[operations/puppet@production] Fix all-etcd, wikikube-master and wikikube-etcd aliases

https://gerrit.wikimedia.org/r/1030995

Change #1030957 merged by JMeybohm:

[operations/dns@master] Add kubestagemaster2004 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1030957

Change #1030958 merged by JMeybohm:

[operations/puppet@production] Add kubestagemaster2004 as master_stacked

https://gerrit.wikimedia.org/r/1030958

Change #1031457 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add kubestagemaster2005 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1031457

Change #1031460 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add kubestagemaster2005 as master_stacked

https://gerrit.wikimedia.org/r/1031460

Change #1031457 merged by JMeybohm:

[operations/dns@master] Add kubestagemaster2005 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1031457

Change #1031460 merged by JMeybohm:

[operations/puppet@production] Add kubestagemaster2005 as master_stacked

https://gerrit.wikimedia.org/r/1031460

Change #1031507 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master: Retry kube-publish-sa-certs 5 times

https://gerrit.wikimedia.org/r/1031507

Change #1031816 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Remove kubestagetcd200[123] from etcd SRV records

https://gerrit.wikimedia.org/r/1031816

Change #1031507 merged by JMeybohm:

[operations/puppet@production] kubernetes::master: Retry kube-publish-sa-certs 5 times

https://gerrit.wikimedia.org/r/1031507

Change #1031825 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Decom kubestagemaster200[12]

https://gerrit.wikimedia.org/r/1031825

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagemaster[2001-2002].codfw.wmnet

  • kubestagemaster2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubestagemaster2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change #1031825 merged by JMeybohm:

[operations/puppet@production] Decom kubestagemaster200[12]

https://gerrit.wikimedia.org/r/1031825

Change #1031882 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master: Make etcd_urls optional

https://gerrit.wikimedia.org/r/1031882

Change #1031816 merged by JMeybohm:

[operations/dns@master] Remove kubestagetcd200[123] from etcd SRV records

https://gerrit.wikimedia.org/r/1031816

Icinga downtime and Alertmanager silence (ID=5c048aeb-57ce-4f8d-8159-53dcf8b5fb78) set by jayme@cumin1002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: decom

kubestagetcd[2001-2003].codfw.wmnet

Change #1031882 merged by JMeybohm:

[operations/puppet@production] kubernetes::master: Make etcd_urls optional

https://gerrit.wikimedia.org/r/1031882

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagetcd[2001-2003].codfw.wmnet

  • kubestagetcd2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubestagetcd2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubestagetcd2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change #1031901 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Remove kubernetesMasters definition from staging-codfw

https://gerrit.wikimedia.org/r/1031901

Change #1031901 merged by JMeybohm:

[operations/deployment-charts@master] Remove kubernetesMasters definition from staging-codfw

https://gerrit.wikimedia.org/r/1031901

Change #1032394 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus/ops: Refactor etcd scraping

https://gerrit.wikimedia.org/r/1032394

Change #1032394 merged by JMeybohm:

[operations/puppet@production] prometheus/ops: Refactor etcd scraping

https://gerrit.wikimedia.org/r/1032394

Change #1032453 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus/ops: Refactor etcd scraping, fix hostnames

https://gerrit.wikimedia.org/r/1032453

Change #1032453 merged by JMeybohm:

[operations/puppet@production] prometheus/ops: Refactor etcd scraping, fix hostnames

https://gerrit.wikimedia.org/r/1032453

Change #1032633 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add kubestagemaster100[345] to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1032633

Change #1032634 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add kubestagemaster100[345] as master_stacked

https://gerrit.wikimedia.org/r/1032634

Change #1032706 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Decom kubestagemaster100[12]

https://gerrit.wikimedia.org/r/1032706

Change #1032707 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Decom kubestagetcd200[123]

https://gerrit.wikimedia.org/r/1032707

Change #1032708 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Remove kubestagetcd100[123] from etcd SRV records

https://gerrit.wikimedia.org/r/1032708

Change #1032709 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Decom kubestagetcd100[123]

https://gerrit.wikimedia.org/r/1032709

Change #1032707 merged by JMeybohm:

[operations/puppet@production] Decom kubestagetcd200[123]

https://gerrit.wikimedia.org/r/1032707

Change #1032633 merged by JMeybohm:

[operations/dns@master] Add kubestagemaster100[345] to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1032633

Change #1032634 merged by JMeybohm:

[operations/puppet@production] Add kubestagemaster100[345] as master_stacked

https://gerrit.wikimedia.org/r/1032634

Change #1032746 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Remove kubestagemaster100[45] from server SRV record

https://gerrit.wikimedia.org/r/1032746

Change #1032746 merged by JMeybohm:

[operations/dns@master] Remove kubestagemaster100[45] from server SRV record

https://gerrit.wikimedia.org/r/1032746

Change #1032748 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add kubestagemaster1004 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1032748

Change #1032749 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add kubestagemaster1005 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1032749

Change #1032748 merged by JMeybohm:

[operations/dns@master] Add kubestagemaster1004 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1032748

Change #1032749 merged by JMeybohm:

[operations/dns@master] Add kubestagemaster1005 to the etcd-server SRV record

https://gerrit.wikimedia.org/r/1032749

Icinga downtime and Alertmanager silence (ID=d858a874-17ca-4ab5-8c9c-7fea35f1c823) set by jayme@cumin1002 for 2 days, 0:00:00 on 2 host(s) and their services with reason: decom

kubestagemaster[1001-1002].eqiad.wmnet

Change #1032706 merged by JMeybohm:

[operations/puppet@production] Decom kubestagemaster100[12]

https://gerrit.wikimedia.org/r/1032706

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagemaster[1001-1002].eqiad.wmnet

  • kubestagemaster1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubestagemaster1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change #1032763 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cumin/aliases: Remove role kubernetes::staging::master

https://gerrit.wikimedia.org/r/1032763

Icinga downtime and Alertmanager silence (ID=dd087345-70da-428c-8704-76433fe47872) set by jayme@cumin1002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: decom

kubestagetcd[1004-1006].eqiad.wmnet

Change #1032708 merged by JMeybohm:

[operations/dns@master] Remove kubestagetcd100[123] from etcd SRV records

https://gerrit.wikimedia.org/r/1032708

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagetcd[1004-1006].eqiad.wmnet

  • kubestagetcd1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubestagetcd1005.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubestagetcd1006.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change #1032709 merged by JMeybohm:

[operations/puppet@production] Decom kubestagetcd100[123]

https://gerrit.wikimedia.org/r/1032709

Change #1032763 merged by JMeybohm:

[operations/puppet@production] Remove remaining occurrences of kubernetes::staging::master role

https://gerrit.wikimedia.org/r/1032763

Both staging clusters have been migrated to stacked control-planes

Change #1034193 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] cumin: Remove etcd::v3::kubernetes::staging from A:wikikube-staging-etcd

https://gerrit.wikimedia.org/r/1034193

Change #1034193 merged by JMeybohm:

[operations/puppet@production] Remove role etcd::v3::kubernetes::staging

https://gerrit.wikimedia.org/r/1034193

Change #1034956 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add wikikube-worker config

https://gerrit.wikimedia.org/r/1034956

Change #1034956 merged by JMeybohm:

[operations/puppet@production] Add wikikube-worker config

https://gerrit.wikimedia.org/r/1034956