Page MenuHomePhabricator

Move servers from the appserver/api cluster to kubernetes
Open, HighPublic

Description

For every 5% of external traffic we move, we've needed to bump mw-web by 12-13 replicas and mw-api-ext by 10 replicas.

This means that for every 5% increase in traffic, we're requiring 22-23 additional replicas. Given every pod requires 5.6 CPUs it means we're going to need about 123 cores per traffic bump, or roughly 3 servers as our servers have 48 cores each.

The above calculation is per-datacenter, of course.

My proposal is to start converting servers, first bringing the appservers cluster down to the same size as the api one, then chipping 2 servers per api group from there on.

I say to try to reach parity first because we will chip into the api cluster first to move mobileapps over to k8s.

Current state of the clusters https://docs.google.com/spreadsheets/d/1VqgWZxmP6LqUgFChIvV5BYvHqr1ZhUh17iXgJ26_1UM/edit#gid=1295795675

This script can be used to automate patch creation a bit: https://gitlab.wikimedia.org/repos/sre/serviceops-kitchensink/-/blob/main/add_k8s_node/add_k8s_node.py?ref_type=heads

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+4 -1
operations/puppetproduction+12 -23
operations/puppetproduction+13 -17
operations/puppetproduction+19 -13
operations/puppetproduction+16 -17
operations/puppetproduction+15 -15
operations/puppetproduction+18 -12
operations/puppetproduction+17 -12
operations/puppetproduction+14 -18
operations/puppetproduction+1 -1
operations/puppetproduction+20 -21
operations/puppetproduction+16 -7
operations/puppetproduction+18 -14
operations/puppetproduction+11 -12
operations/puppetproduction+16 -15
operations/puppetproduction+20 -14
operations/puppetproduction+27 -11
operations/puppetproduction+22 -14
operations/puppetproduction+20 -12
operations/puppetproduction+25 -14
operations/puppetproduction+14 -13
operations/puppetproduction+8 -14
operations/puppetproduction+11 -5
operations/puppetproduction+14 -5
operations/puppetproduction+43 -28
operations/puppetproduction+5 -1
operations/puppetproduction+2 -2
operations/puppetproduction+44 -30
operations/puppetproduction+23 -36
operations/homer/publicmaster+9 -0
operations/puppetproduction+11 -6
operations/homer/publicmaster+2 -0
operations/homer/publicmaster+8 -0
operations/cookbooksmaster+3 -1
operations/puppetproduction+35 -21
operations/puppetproduction+72 -72
operations/homer/publicmaster+39 -39
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1013536 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1013536

Mentioned in SAL (#wikimedia-operations) [2024-03-25T10:25:15Z] <claime> Depooling mw2336.codfw.wmnet,mw2337.codfw.wmnet,mw2386.codfw.wmnet,mw2387.codfw.wmnet,mw2388.codfw.wmnet,mw2389.codfw.wmnet - T351074

Change #1013536 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 6 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1013536

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2336.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2337.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2386.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2387.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2388.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2389.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2386.codfw.wmnet with OS bullseye completed:

  • mw2386 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251100_cgoubert_2983726_mw2386.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2388.codfw.wmnet with OS bullseye completed:

  • mw2388 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251102_cgoubert_2983844_mw2388.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2336.codfw.wmnet with OS bullseye completed:

  • mw2336 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251104_cgoubert_2983615_mw2336.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2387.codfw.wmnet with OS bullseye completed:

  • mw2387 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251107_cgoubert_2983773_mw2387.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2389.codfw.wmnet with OS bullseye completed:

  • mw2389 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251109_cgoubert_2983933_mw2389.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2337.codfw.wmnet with OS bullseye completed:

  • mw2337 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251111_cgoubert_2983632_mw2337.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-03-25T12:25:35Z] <claime> Running homer 'cr*codfw*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-03-25T12:34:41Z] <claime> Pooling and uncordoning mw2336.codfw.wmnet,mw2337.codfw.wmnet,mw2386.codfw.wmnet,mw2387.codfw.wmnet,mw2388.codfw.wmnet,mw2389.codfw.wmnet - T351074

Change #1018655 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 eqiad api_appservers to kubernetes

https://gerrit.wikimedia.org/r/1018655

Mentioned in SAL (#wikimedia-operations) [2024-04-10T10:59:42Z] <claime> Depooling mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074

Change #1018655 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 6 eqiad api_appservers to kubernetes

https://gerrit.wikimedia.org/r/1018655

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1421.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1422.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1491.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1492.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1493.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1421.eqiad.wmnet with OS bullseye completed:

  • mw1421 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101124_cgoubert_1784876_mw1421.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1493.eqiad.wmnet with OS bullseye completed:

  • mw1493 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101127_cgoubert_1785103_mw1493.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1422.eqiad.wmnet with OS bullseye completed:

  • mw1422 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101131_cgoubert_1784925_mw1422.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1491.eqiad.wmnet with OS bullseye completed:

  • mw1491 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101134_cgoubert_1784990_mw1491.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1492.eqiad.wmnet with OS bullseye completed:

  • mw1492 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101138_cgoubert_1785051_mw1492.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-10T12:01:31Z] <claime> Running homer 'cr*eqiad*' commit 'T351074' and homer 'lsw1-e3-eqiad*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-10T12:11:53Z] <claime> Pooling and uncordoning mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074

Change #1018719 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Move 7 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1018719

Change #1018719 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Move 7 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1018719

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2412.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2413.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2414.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2415.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2416.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2417.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2418.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2414.codfw.wmnet with OS bullseye completed:

  • mw2414 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404110957_cgoubert_1997382_mw2414.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2417.codfw.wmnet with OS bullseye completed:

  • mw2417 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111000_cgoubert_1997604_mw2417.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2416.codfw.wmnet with OS bullseye completed:

  • mw2416 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111003_cgoubert_1997528_mw2416.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2415.codfw.wmnet with OS bullseye completed:

  • mw2415 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111006_cgoubert_1997473_mw2415.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2412.codfw.wmnet with OS bullseye completed:

  • mw2412 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111009_cgoubert_1997269_mw2412.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2418.codfw.wmnet with OS bullseye completed:

  • mw2418 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111013_cgoubert_1997681_mw2418.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2413.codfw.wmnet with OS bullseye completed:

  • mw2413 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111017_cgoubert_1997332_mw2413.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-11T10:37:25Z] <claime> Running homer 'cr*codfw*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-11T10:52:52Z] <claime> Pooling and uncordoning mw2412.codfw.wmnet,mw2413.codfw.wmnet,mw2414.codfw.wmnet,mw2415.codfw.wmnet,mw2416.codfw.wmnet,mw2417.codfw.wmnet,mw2418.codfw.wmnet - T351074

Change #1020852 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 appservers from codfw

https://gerrit.wikimedia.org/r/1020852

Mentioned in SAL (#wikimedia-operations) [2024-04-18T10:25:38Z] <claime> Depooling mw2302.codfw.wmnet,mw2303.codfw.wmnet,mw2304.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet for reimage - T351074

Change #1020852 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 6 appservers from codfw

https://gerrit.wikimedia.org/r/1020852

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2302.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2303.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2304.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2332.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2333.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2334.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2303.codfw.wmnet with OS bullseye completed:

  • mw2303 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181057_cgoubert_3396044_mw2303.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2332.codfw.wmnet with OS bullseye completed:

  • mw2332 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181100_cgoubert_3396165_mw2332.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2334.codfw.wmnet with OS bullseye completed:

  • mw2334 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181102_cgoubert_3396281_mw2334.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2304.codfw.wmnet with OS bullseye completed:

  • mw2304 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181105_cgoubert_3396102_mw2304.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2302.codfw.wmnet with OS bullseye completed:

  • mw2302 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181110_cgoubert_3396020_mw2302.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2333.codfw.wmnet with OS bullseye completed:

  • mw2333 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181108_cgoubert_3396227_mw2333.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-18T11:42:04Z] <claime> Running homer 'cr*codfw*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-18T11:52:03Z] <claime> Pooling and uncordoning mw2302.codfw.wmnet,mw2303.codfw.wmnet,mw2304.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet - T351074

Change #1021478 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 api_appservers from eqiad

https://gerrit.wikimedia.org/r/1021478

Change #1021478 abandoned by Clément Goubert:

[operations/puppet@production] kubernetes: move 5 api_appservers from eqiad

Reason:

That would only leave 15 api_appservers in eqiad.

https://gerrit.wikimedia.org/r/1021478

Change #1021482 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 4 appservers to kubernetes

https://gerrit.wikimedia.org/r/1021482

Mentioned in SAL (#wikimedia-operations) [2024-04-18T14:12:40Z] <claime> Depooling mw1355.eqiad.wmnet,mw1480.eqiad.wmnet,mw1481.eqiad.wmnet,mw1487.eqiad.wmnet - T351074

Change #1021482 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 4 appservers to kubernetes

https://gerrit.wikimedia.org/r/1021482

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1355.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1480.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1481.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1487.eqiad.wmnet with OS bullseye

Change #1021495 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] site.pp: Switch mw1365 to canary_appserver

https://gerrit.wikimedia.org/r/1021495

I abandoned the CR to move more eqiad api_appservers because it would leave only 15, 4 of them canaries. We still have a bit more margin on the appserver side in eqiad.

Something to note regarding canaries:

  • eqiad api_appserver canaries: 4/20 (20%), 3 in row A, 1 in row D
  • eqiad appserver canaries: 5/24 (21%), all in row A
  • codfw api_appserver canaries: 2/34 (6%)
  • codfw appserver canaries: 2/35 (6%)

All the canaries in eqiad are in row A except for one in row D. I propose going down to two canaries per cluster (directly re-imaging them to k8s nodes), keeping 1 in row A and 1 in row D for api_appserver, and moving one of the appserver canaries to another row so we can go down to 1 in row A, and 1 somewhere else, to give us a bit of an easier time keeping row-level availability.

All the canaries in codfw are up for decommission, so we don't need to do anything about them.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1481.eqiad.wmnet with OS bullseye completed:

  • mw1481 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181437_cgoubert_3445367_mw1481.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1487.eqiad.wmnet with OS bullseye completed:

  • mw1487 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181440_cgoubert_3445469_mw1487.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1355.eqiad.wmnet with OS bullseye completed:

  • mw1355 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181442_cgoubert_3445214_mw1355.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1480.eqiad.wmnet with OS bullseye completed:

  • mw1480 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181445_cgoubert_3445283_mw1480.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-18T15:04:39Z] <claime> Running homer 'cr*eqiad*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-18T15:12:56Z] <claime> Pooling and uncordoning mw1355.eqiad.wmnet,mw1480.eqiad.wmnet,mw1481.eqiad.wmnet,mw1487.eqiad.wmnet - T351074

Mentioned in SAL (#wikimedia-operations) [2024-04-23T10:45:26Z] <claime> Depooling mw1414.eqiad.wmnet,mw1415.eqiad.wmnet,mw1416.eqiad.wmnet,mw1448.eqiad.wmnet,mw1449.eqiad.wmnet for reimage to kubernetes - T351074

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1414.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1415.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1416.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1448.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1449.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1414.eqiad.wmnet with OS bullseye completed:

  • mw1414 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231110_cgoubert_106747_mw1414.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1449.eqiad.wmnet with OS bullseye completed:

  • mw1449 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231112_cgoubert_107034_mw1449.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1416.eqiad.wmnet with OS bullseye completed:

  • mw1416 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231115_cgoubert_106887_mw1416.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1448.eqiad.wmnet with OS bullseye completed:

  • mw1448 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231117_cgoubert_106976_mw1448.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1415.eqiad.wmnet with OS bullseye completed:

  • mw1415 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231121_cgoubert_106784_mw1415.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-23T11:39:26Z] <claime> Running homer 'cr*eqiad*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-23T11:47:26Z] <claime> Pooling and uncordoning mw1414.eqiad.wmnet,mw1415.eqiad.wmnet,mw1416.eqiad.wmnet,mw1448.eqiad.wmnet,mw1449.eqiad.wmnet - T351074