Page MenuHomePhabricator

Migrate wikikube control planes to hardware nodes
Closed, ResolvedPublic

Assigned To
Authored By
JMeybohm
Dec 14 2023, 4:09 PM
Referenced Files
F48799280: etcd-benchmark-output-kubestagemaster2003.txt
Apr 26 2024, 2:58 PM
F48799281: etcd-benchmark-output-kubestagemaster2003_isolated.txt
Apr 26 2024, 2:58 PM
F48799282: etcd-benchmark-output-ganeti-test2003.txt
Apr 26 2024, 2:58 PM
F48799283: etcd-benchmark-output-mw2391.txt
Apr 26 2024, 2:58 PM
F41610964: image.png
Dec 18 2023, 1:37 PM
F41610962: image.png
Dec 18 2023, 1:37 PM
F41610960: image.png
Dec 18 2023, 1:37 PM
F41610958: image.png
Dec 18 2023, 1:37 PM

Description

Currently we run 2 control planes as well as 3 etcd nodes per DC as ganeti VMs. We already hit limits in terms of IOPS on the etcd instances and we do scratch on the upper "limit" for memory on ganeti for the control planes (12GB currently).

We should draft a plan to migrate from the 2+3 ganeti instances to 3 hardware nodes (repurposing mw appservers) and co-locate a kubernetes master and etcd sever on each of them.

It should be possible to do this by adding the new control-planes/etcd nodes and remove the ganeti ones after.

In the spreadsheet at T351074: Move servers from the appserver/api cluster to kubernetes I've reserved 3 R440 nodes per DC to be used as apiservers:

  • mw2331 => wikikube-ctrl2001 => To be refreshed FY2425 Q3
  • mw2361 => wikikube-ctrl2002 => To be refreshed FY2425 Q3
  • mw2391 => wikikube-ctrl2003
  • mw1372 => wikikube-ctrl1001
  • mw1429 => wikikube-ctrl1002
  • mw1436 => wikikube-ctrl1003

These should be renamed during reimage because of their special role in the cluster:

I wrote a documentation on how to add stacked control-planes and how to remove them as well as etcd nodes at: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_control-planes

For preparation we should reimage the above appservers to insetup using the same partition layout as we use for kubernetes workers.

What I totally failed to think about while doing staging is the opportunity to align wikikube control-plane names with the other clusters which use names like ml-serve-ctrlXXXX/aux-k8s-ctrlXXXX. So maybe we could rename to wikikube-ctrlXXX (I really don't like the k8s that dse and aux threw in the mix) to come one step closer to T336861: Fix naming confusion around main/wikikube kubernetes clusters.

Details

Other Assignee
jijiki
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+0 -39
operations/puppetproduction+1 -80
operations/puppetproduction+0 -8
operations/puppetproduction+1 -24
operations/dnsmaster+0 -6
operations/puppetproduction+1 -1
operations/dnsmaster+0 -12
operations/puppetproduction+0 -11
operations/puppetproduction+1 -7
operations/dnsmaster+1 -0
operations/dnsmaster+1 -0
operations/dnsmaster+1 -0
operations/dnsmaster+1 -0
operations/dnsmaster+1 -0
operations/puppetproduction+10 -11
operations/dnsmaster+1 -0
operations/puppetproduction+1 -6
operations/puppetproduction+1 -6
operations/software/homer/deploymaster+3 -1
operations/puppetproduction+13 -5
operations/puppetproduction+3 -3
operations/puppetproduction+4 -0
operations/puppetproduction+32 -8
operations/puppetproduction+67 -0
operations/puppetproduction+0 -6
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl2001.codfw.wmnet with OS bullseye completed:

  • wikikube-ctrl2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405211613_hnowlan_1680662_wikikube-ctrl2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl2003.codfw.wmnet with OS bullseye completed:

  • wikikube-ctrl2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405211616_hnowlan_1680751_wikikube-ctrl2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-ctrl2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-ctrl2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye completed:

  • wikikube-ctrl1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405211741_hnowlan_1687876_wikikube-ctrl1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl2002.codfw.wmnet with OS bullseye completed:

  • wikikube-ctrl2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405211743_hnowlan_1699808_wikikube-ctrl2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye completed:

  • wikikube-ctrl1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405211757_hnowlan_1692042_wikikube-ctrl1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye completed:

  • wikikube-ctrl1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405211822_hnowlan_1709925_wikikube-ctrl1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Change #1034444 merged by Hnowlan:

[operations/dns@master] Add wikikube-ctrl2001 to server SRV record for etcd

https://gerrit.wikimedia.org/r/1034444

Change #1034449 merged by Hnowlan:

[operations/puppet@production] Add wikikube-ctrl200[1-3] as master_stacked

https://gerrit.wikimedia.org/r/1034449

Change #1034849 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add wikikube-ctrl2002 as master_stacked

https://gerrit.wikimedia.org/r/1034849

Change #1034850 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add wikikube-ctrl2003 as master_stacked

https://gerrit.wikimedia.org/r/1034850

Change #1034853 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer/deploy@master] Add wikikube-ctrl to Homer wmf plugin to assign to k8s BGP group

https://gerrit.wikimedia.org/r/1034853

Change #1034853 merged by Cathal Mooney:

[operations/software/homer/deploy@master] Add wikikube-ctrl to Homer wmf plugin to assign to k8s BGP group

https://gerrit.wikimedia.org/r/1034853

Mentioned in SAL (#wikimedia-operations) [2024-05-22T10:00:28Z] <cmooney@cumin1002> START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to hostname to bgp group mappings - cmooney@cumin1002 - T353464

Deployed homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to hostname to bgp group mappings - cmooney@cumin1002 - T353464

Mentioned in SAL (#wikimedia-operations) [2024-05-22T10:02:08Z] <cmooney@cumin1002> END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to hostname to bgp group mappings - cmooney@cumin1002 - T353464

Change #1034445 merged by Hnowlan:

[operations/dns@master] Add wikikube-ctrl2002 to server SRV record for etcd

https://gerrit.wikimedia.org/r/1034445

Change #1034849 merged by Hnowlan:

[operations/puppet@production] Add wikikube-ctrl2002 as master_stacked

https://gerrit.wikimedia.org/r/1034849

Change #1034446 merged by Hnowlan:

[operations/dns@master] Add wikikube-ctrl2003 to server SRV record for etcd

https://gerrit.wikimedia.org/r/1034446

Change #1034850 merged by Hnowlan:

[operations/puppet@production] Add wikikube-ctrl2003 as master_stacked

https://gerrit.wikimedia.org/r/1034850

All codfw wikikube-ctrl nodes are operational

Change #1036615 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] Add wikikube-ctrl100[1-3] as master_stacked 2

https://gerrit.wikimedia.org/r/1036615

Change #1036621 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] Add wikikube-ctrl1001 to server SRV record for etcd 1

https://gerrit.wikimedia.org/r/1036621

Change #1036622 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] Add wikikube-ctrl1002 to server SRV record for etcd 3

https://gerrit.wikimedia.org/r/1036622

Change #1036623 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] Add wikikube-ctrl1003 to server SRV record for etcd 4

https://gerrit.wikimedia.org/r/1036623

Change #1036621 merged by Effie Mouzeli:

[operations/dns@master] Add wikikube-ctrl1001 to server SRV record for etcd 1

https://gerrit.wikimedia.org/r/1036621

Change #1036615 merged by Effie Mouzeli:

[operations/puppet@production] Add wikikube-ctrl100[1-3] as master_stacked 2

https://gerrit.wikimedia.org/r/1036615

Change #1036622 merged by Effie Mouzeli:

[operations/dns@master] Add wikikube-ctrl1002 to server SRV record for etcd 3

https://gerrit.wikimedia.org/r/1036622

Change #1036623 merged by Effie Mouzeli:

[operations/dns@master] Add wikikube-ctrl1003 to server SRV record for etcd 4

https://gerrit.wikimedia.org/r/1036623

Mentioned in SAL (#wikimedia-operations) [2024-05-30T10:07:57Z] <effie> add wikikube-ctrl1003 to etcd and run puppet - T353464

Current status:

eqiad and codfw:

  • (baremetal) wikikube-ctrl hosts are in production as stacked masters
    • have joined the etcd cluster
    • are labeled as kubernetes masters
    • have BGP enabled
    • pending 10G NIC upgrade T366204 and T366205
  • (VMs) kubemasters are still in production
    • will be decommed after T366204 and T366205 and we are sure everything is stable
jijiki changed the task status from Open to Stalled.May 30 2024, 10:50 AM
jijiki reassigned this task from JMeybohm to hnowlan.
jijiki updated the task description. (Show Details)
jijiki updated Other Assignee, added: jijiki.

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

I think more that we ran out of time to make changes last week. Removing them from the etcd cluster ahead of time seems fine to me, at least.

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

I think more that we ran out of time to make changes last week. Removing them from the etcd cluster ahead of time seems fine to me, at least.

Any objections to wait for T366204 and T366205 to be completed before we remove the ganeti VMs?

While I you re probably right, I think it is feels slightly easier to just get rid of the old stuff all at once, logistically

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

I think more that we ran out of time to make changes last week. Removing them from the etcd cluster ahead of time seems fine to me, at least.

Any objections to wait for T366204 and T366205 to be completed before we remove the ganeti VMs?

While I you re probably right, I think it is feels slightly easier to just get rid of the old stuff all at once, logistically

Yeah, right. As the old control planes won't be able to connect to the etcd instances on wikikube-ctrl* in the current state, we would need to reconfigure the firewall at least. Also using the etcd instances on wikikube-ctrl* from the kubemaster VM's would add additional TX. So fine by me keeping them until the 10G NICs are ready.

jijiki changed the task status from Stalled to In Progress.Jun 19 2024, 1:36 PM

Unless something else pops up, we shall be retiring the old hosts (aka the VMs) next week

Unless something else pops up, we shall be retiring the old hosts (aka the VMs) next week

If something else pops up, please let me know and I can take care of this

Icinga downtime and Alertmanager silence (ID=59cc19c2-5b6b-4d0a-81ef-1bd409efc10c) set by jiji@cumin1002 for 2 days, 0:00:00 on 2 host(s) and their services with reason: decom

kubemaster[2001-2002].codfw.wmnet

Change #1051321 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kubernetes: retire kubemaster200[1-2]

https://gerrit.wikimedia.org/r/1051321

Change #1051323 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] kubernetes: retire kubemaster100[1-2] in eqiad

https://gerrit.wikimedia.org/r/1051323

Change #1051321 merged by Effie Mouzeli:

[operations/puppet@production] kubernetes: retire kubemaster200[1-2] in codfw

https://gerrit.wikimedia.org/r/1051321

cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: kubemaster[2001-2002].codfw.wmnet

  • kubemaster2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubemaster2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Mentioned in SAL (#wikimedia-operations) [2024-07-02T12:44:20Z] <effie> decom eqiad old kubemasters - T353464

Icinga downtime and Alertmanager silence (ID=2aa77d1f-f420-475b-8769-b2c46d51c3fe) set by jiji@cumin1002 for 2 days, 0:00:00 on 2 host(s) and their services with reason: decom

kubemaster[1001-1002].eqiad.wmnet

cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: kubemaster[1001-1002].eqiad.wmnet

  • kubemaster1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubemaster1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change #1051323 merged by Effie Mouzeli:

[operations/puppet@production] kubernetes: retire kubemaster100[1-2] in eqiad

https://gerrit.wikimedia.org/r/1051323

Change #1051365 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] cumin: fix kube-master aliases

https://gerrit.wikimedia.org/r/1051365

Icinga downtime and Alertmanager silence (ID=53db6080-f00f-4a86-ae49-cafba7047a9d) set by jiji@cumin1002 for 2 days, 0:00:00 on 6 host(s) and their services with reason: decom

kubetcd[2004-2006].codfw.wmnet,kubetcd[1004-1006].eqiad.wmnet

Change #1051380 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] Remove kubetcd100 from etcd SRV records

https://gerrit.wikimedia.org/r/1051380

Change #1051380 merged by Effie Mouzeli:

[operations/dns@master] Remove kubetcd* from etcd SRV records (eqiad+codfw)

https://gerrit.wikimedia.org/r/1051380

cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: kubetcd[1004-1006].eqiad.wmnet

  • kubetcd1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubetcd1005.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubetcd1006.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: kubetcd[2004-2006].codfw.wmnet

  • kubetcd2004.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubetcd2005.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubetcd2006.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

This is done, please reopen should something went wrong. Tx @JMeybohm for the excellent docs

Change #1051678 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes: Remove etcd_urls from wikikube clusters

https://gerrit.wikimedia.org/r/1051678

Change #1051365 merged by Effie Mouzeli:

[operations/puppet@production] cumin: fix kube-master aliases

https://gerrit.wikimedia.org/r/1051365

Change #1034447 abandoned by JMeybohm:

[operations/dns@master] Remove kubetcd200[4-6] from etcd SRV records

Reason:

has been done in a different CR

https://gerrit.wikimedia.org/r/1034447

Change #1052933 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Remove role::etcd::v3::kubernetes and hosts

https://gerrit.wikimedia.org/r/1052933

Change #1052933 merged by Alexandros Kosiaris:

[operations/puppet@production] Remove role::etcd::v3::kubernetes and hosts

https://gerrit.wikimedia.org/r/1052933

Change #1051678 abandoned by JMeybohm:

[operations/puppet@production] kubernetes: Remove etcd_urls from wikikube clusters

Reason:

now done in I6ae65c2ba04f4ea50d60f2314c7b6727c855d987

https://gerrit.wikimedia.org/r/1051678

Change #1073857 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube: Remove remaining hiera files and role for non stacked masters

https://gerrit.wikimedia.org/r/1073857

Change #1073857 merged by JMeybohm:

[operations/puppet@production] wikikube: Remove remaining hiera files and role for non stacked masters

https://gerrit.wikimedia.org/r/1073857

Noting that in Q3 FY24-25, that is the quarter starting on January 2025, we 'll be refreshing mw[2291-2376], which includes wikikube-ctrl2001 and wikikube-ctrl2001

Change #1080584 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] role::etcd::v3::kubernetes is no more

https://gerrit.wikimedia.org/r/1080584

Change #1080585 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master_stacked: Limit etcd access to localhost

https://gerrit.wikimedia.org/r/1080585

Change #1080584 merged by JMeybohm:

[operations/puppet@production] role::etcd::v3::kubernetes is no more

https://gerrit.wikimedia.org/r/1080584

Change #1080585 merged by JMeybohm:

[operations/puppet@production] kubernetes::master_stacked: Limit etcd access to localhost

https://gerrit.wikimedia.org/r/1080585