Page MenuHomePhabricator

Move servers from the appserver/api cluster to kubernetes
Closed, ResolvedPublic

Description

For every 5% of external traffic we move, we've needed to bump mw-web by 12-13 replicas and mw-api-ext by 10 replicas.

This means that for every 5% increase in traffic, we're requiring 22-23 additional replicas. Given every pod requires 5.6 CPUs it means we're going to need about 123 cores per traffic bump, or roughly 3 servers as our servers have 48 cores each.

The above calculation is per-datacenter, of course.

My proposal is to start converting servers, first bringing the appservers cluster down to the same size as the api one, then chipping 2 servers per api group from there on.

I say to try to reach parity first because we will chip into the api cluster first to move mobileapps over to k8s.

Current state of the clusters https://docs.google.com/spreadsheets/d/1VqgWZxmP6LqUgFChIvV5BYvHqr1ZhUh17iXgJ26_1UM/edit#gid=1295795675

This script can be used to automate patch creation a bit: https://gitlab.wikimedia.org/repos/sre/serviceops-kitchensink/-/blob/main/add_k8s_node/add_k8s_node.py?ref_type=heads

With the rename cookbook being finalized, we should/could start using it when moving the remaining servers (T365571: Rename wikikube worker nodes during OS reimage). But please avoid using wikikube-worker[12]0[01][56] for anything else then the dedicated sessionstore nodes (just to keep the numbering identical when changing the name)

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+4 -1
operations/puppetproduction+3 -9
operations/puppetproduction+3 -6
operations/puppetproduction+3 -6
operations/puppetproduction+4 -4
operations/puppetproduction+9 -15
operations/puppetproduction+11 -21
operations/puppetproduction+13 -21
operations/puppetproduction+11 -32
operations/puppetproduction+15 -22
operations/puppetproduction+14 -20
operations/puppetproduction+13 -20
operations/puppetproduction+6 -17
operations/puppetproduction+9 -21
operations/puppetproduction+21 -26
operations/puppetproduction+11 -19
operations/puppetproduction+16 -10
operations/puppetproduction+26 -20
operations/puppetproduction+12 -23
operations/puppetproduction+13 -17
operations/puppetproduction+19 -13
operations/puppetproduction+16 -17
operations/puppetproduction+15 -15
operations/puppetproduction+18 -12
operations/puppetproduction+17 -12
operations/puppetproduction+14 -18
operations/puppetproduction+1 -1
operations/puppetproduction+20 -21
operations/puppetproduction+16 -7
operations/puppetproduction+18 -14
operations/puppetproduction+11 -12
operations/puppetproduction+16 -15
operations/puppetproduction+20 -14
operations/puppetproduction+27 -11
operations/puppetproduction+22 -14
operations/puppetproduction+20 -12
operations/puppetproduction+25 -14
operations/puppetproduction+14 -13
operations/puppetproduction+8 -14
operations/puppetproduction+11 -5
operations/puppetproduction+14 -5
operations/puppetproduction+43 -28
operations/puppetproduction+5 -1
operations/puppetproduction+2 -2
operations/puppetproduction+44 -30
operations/puppetproduction+23 -36
operations/homer/publicmaster+9 -0
operations/puppetproduction+11 -6
operations/homer/publicmaster+2 -0
operations/homer/publicmaster+8 -0
operations/cookbooksmaster+3 -1
operations/puppetproduction+35 -21
operations/puppetproduction+72 -72
operations/homer/publicmaster+39 -39
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-07-02T11:12:29Z] <claime> pooling and uncordoning wikikube-worker2025.codfw.wmnet|wikikube-worker2026.codfw.wmnet|wikikube-worker2027.codfw.wmnet|wikikube-worker2028.codfw.wmnet|wikikube-worker2029.codfw.wmnet - T351074

Change #1051328 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 5 appservers to kubernetes

https://gerrit.wikimedia.org/r/1051328

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2307 to wikikube-worker2030 completed:

  • mw2307 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2309 to wikikube-worker2031 completed:

  • mw2309 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2365 to wikikube-worker2032 completed:

  • mw2365 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2392 to wikikube-worker2033 completed:

  • mw2392 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2393 to wikikube-worker2034 completed:

  • mw2393 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2030.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2031.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2032.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2033.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2034.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2030.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2030 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021236_cgoubert_2351469_wikikube-worker2030.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2031.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2031 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021239_cgoubert_2351495_wikikube-worker2031.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2034.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2034 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021242_cgoubert_2351682_wikikube-worker2034.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2033.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2033 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021245_cgoubert_2351640_wikikube-worker2033.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2032.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2032 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021249_cgoubert_2351589_wikikube-worker2032.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-07-02T13:35:29Z] <claime> Pooling and uncordoning wikikube-worker2030.codfw.wmnet wikikube-worker2031.codfw.wmnet wikikube-worker2032.codfw.wmnet wikikube-worker2033.codfw.wmnet wikikube-worker2034.codfw.wmnet - T351074

Only hosts left are:

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster completed:

  • mw1349 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121523_cgoubert_88189_mw1349.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster completed:

  • mw1350 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121526_cgoubert_88236_mw1350.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster completed:

  • mw1351 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121533_cgoubert_88255_mw1351.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster completed:

  • mw1349 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121627_cgoubert_101741_mw1349.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster completed:

  • mw1350 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121629_cgoubert_101834_mw1350.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster completed:

  • mw1351 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121632_cgoubert_101912_mw1351.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-07-15T09:29:05Z] <claime> manually removing mw1349.eqiad.wmnet mw1350.eqiad.wmnet mw1351.eqiad.wmnet from k8s following reimage to videoscalers - T351074

Change #1055237 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: rename 4 appservers to k8s workers

https://gerrit.wikimedia.org/r/1055237

Change #1055237 merged by Clément Goubert:

[operations/puppet@production] kubernetes: rename 4 appservers to k8s workers

https://gerrit.wikimedia.org/r/1055237

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2432 to wikikube-worker2035 completed:

  • mw2432 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2433 to wikikube-worker2036 completed:

  • mw2433 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2438 to wikikube-worker2037 completed:

  • mw2438 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2439 to wikikube-worker2038 completed:

  • mw2439 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2035.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2036.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2037.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2038.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2037.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-worker2037 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker2037.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2038.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-worker2038 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker2038.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2038.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2036.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2036 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407191431_cgoubert_1391849_wikikube-worker2036.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2035.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2035 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407191434_cgoubert_1391793_wikikube-worker2035.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2037.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2038.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-worker2038 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker2038.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2038.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2037.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2037 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407191520_cgoubert_1402512_wikikube-worker2037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2038.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2038 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407191532_cgoubert_1403270_wikikube-worker2038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-07-22T16:31:57Z] <claime> Pooling and uncordoning wikikube-worker2035.codfw.wmnet wikikube-worker2036.codfw.wmnet wikikube-worker2037.codfw.wmnet wikikube-worker2038.codfw.wmnet - T351074

Change #1056974 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Rename / reimage one appserver to k8s worker

https://gerrit.wikimedia.org/r/1056974

Change #1056974 merged by Scott French:

[operations/puppet@production] Rename / reimage one appserver to k8s worker

https://gerrit.wikimedia.org/r/1056974

Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin1002 from mw1364 to wikikube-worker1032 completed:

  • mw1364 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin1002 for host wikikube-worker1032.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2024-07-25T17:06:52Z] <swfrench-wmf> running homer 'cr*eqiad*' commit 'T351074' for k8s worker reimage

Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin1002 for host wikikube-worker1032.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1032 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407251648_swfrench_2477944_wikikube-worker1032.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-07-25T17:20:19Z] <swfrench@cumin1002> conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1032.eqiad.wmnet),cluster=kubernetes,service=kubesvc [reason: T351074 - pooling after reimage]

Change #1057829 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: reimage 1 appserver to kubernetes

https://gerrit.wikimedia.org/r/1057829

Change #1057829 merged by Clément Goubert:

[operations/puppet@production] kubernetes: reimage 1 appserver to kubernetes

https://gerrit.wikimedia.org/r/1057829

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2441 to wikikube-worker2039 completed:

  • mw2441 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Netbox updated
    • ✔️ IDRAC updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2039.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2039.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2039 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291232_cgoubert_3148847_wikikube-worker2039.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-07-29T14:07:48Z] <cgoubert@cumin1002> conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2039.codfw.wmnet),cluster=kubernetes,service=kubesvc [reason: Pooling and uncordoning - T351074]

Mentioned in SAL (#wikimedia-operations) [2024-08-20T16:38:43Z] <claime> Running homer 'lsw1-a3-codfw*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-08-20T16:41:30Z] <claime> Pooling wikikube-worker2040.codfw.wmnet - T351074

Change #1069223 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Rename last appserver in codfw

https://gerrit.wikimedia.org/r/1069223

Change #1069223 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Rename last appserver in codfw

https://gerrit.wikimedia.org/r/1069223

Change #1069227 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Rename last appserver in codfw

https://gerrit.wikimedia.org/r/1069227

Change #1069227 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Rename last appserver in eqiad

https://gerrit.wikimedia.org/r/1069227

Mentioned in SAL (#wikimedia-operations) [2024-08-30T15:41:35Z] <claime> homer 'lsw1-a3-codfw*' commit 'T351074'

Only hosts left are:

Clement_Goubert claimed this task.

Change #1021495 abandoned by Clément Goubert:

[operations/puppet@production] site.pp: Switch mw1365 to canary_appserver

https://gerrit.wikimedia.org/r/1021495