Move servers from the appserver/api cluster to kubernetes
Open, HighPublic
Actions

Assigned To

None

Authored By

	Joe
	Nov 13 2023, 10:37 AM

Description

For every 5% of external traffic we move, we've needed to bump mw-web by 12-13 replicas and mw-api-ext by 10 replicas.

This means that for every 5% increase in traffic, we're requiring 22-23 additional replicas. Given every pod requires 5.6 CPUs it means we're going to need about 123 cores per traffic bump, or roughly 3 servers as our servers have 48 cores each.

The above calculation is per-datacenter, of course.

My proposal is to start converting servers, first bringing the appservers cluster down to the same size as the api one, then chipping 2 servers per api group from there on.

I say to try to reach parity first because we will chip into the api cluster first to move mobileapps over to k8s.

Current state of the clusters https://docs.google.com/spreadsheets/d/1VqgWZxmP6LqUgFChIvV5BYvHqr1ZhUh17iXgJ26_1UM/edit#gid=1295795675

This script can be used to automate patch creation a bit: https://gitlab.wikimedia.org/repos/sre/serviceops-kitchensink/-/blob/main/add_k8s_node/add_k8s_node.py?ref_type=heads

Details

Subject	Repo	Branch	Lines +/-
site.pp: Switch mw1365 to canary_appserver	operations/puppet	production	+4 -1
kubernetes: move 4 appservers to kubernetes	operations/puppet	production	+12 -23
kubernetes: move 5 api_appservers from eqiad	operations/puppet	production	+13 -17
kubernetes: move 6 appservers from codfw	operations/puppet	production	+19 -13
kubernetes: Move 7 codfw appservers to kubernetes	operations/puppet	production	+16 -17
kubernetes: move 6 eqiad api_appservers to kubernetes	operations/puppet	production	+15 -15
kubernetes: move 6 codfw appservers to kubernetes	operations/puppet	production	+18 -12
kubernetes: migrate 5 eqiad appservers to k8s workers	operations/puppet	production	+17 -12
Move 6 eqiad appservers to kubernetes	operations/puppet	production	+14 -18
Add missing node definition	operations/puppet	production	+1 -1
Move 6 codfw appservers to kubernetes	operations/puppet	production	+20 -21
Move 5 eqiad appservers to kubernetes	operations/puppet	production	+16 -7
Move 6 codfw appservers to kubernetes	operations/puppet	production	+18 -14
Move 3 appservers to kubernetes	operations/puppet	production	+11 -12
Move 6 eqiad appservers to kubernetes	operations/puppet	production	+16 -15
kubernetes: migrate 5 appservers to k8s workers	operations/puppet	production	+20 -14
kubernetes: make 4 eqiad appservers k8s workers	operations/puppet	production	+27 -11
kubernetes: make 5 codfw appservers kubernetes workers	operations/puppet	production	+22 -14
kubernetes: move 5 mw hosts to kubernetes workers	operations/puppet	production	+20 -12
kubernetes: make 5 appservers k8s workers	operations/puppet	production	+25 -14
kubernetes: make 3 appservers kubernetes workers	operations/puppet	production	+14 -13
kubernetes: make 3 appservers kubernetes workers	operations/puppet	production	+8 -14
kubernetes: make 3 api_appservers kubernetes workers	operations/puppet	production	+11 -5
kubernetes: make 3 api_appservers kubernetes workers	operations/puppet	production	+14 -5
Move mw api servers to kubernetes workers	operations/puppet	production	+43 -28
mw1377: change role to insetup for debugging	operations/puppet	production	+5 -1
Set MW API servers to insetup to fix failed reimage	operations/puppet	production	+2 -2
Move mw api servers to kubernetes workers	operations/puppet	production	+44 -30
Move mw api servers to kubernetes workers	operations/puppet	production	+23 -36
Move mw api servers to kubernetes workers	operations/homer/public	master	+9 -0
Move mw api servers to kubernetes workers	operations/puppet	production	+11 -6
Move mw api servers to kubernetes workers	operations/homer/public	master	+2 -0
Move mw appservers to kubernetes workers	operations/homer/public	master	+8 -0
sre.hosts.reimage: Allow to skip puppet migration	operations/cookbooks	master	+3 -1
Move mw appservers to kubernetes workers	operations/puppet	production	+35 -21
Normalize conftool-data/node/{eqiad,codfw}.yaml to be machine editable	operations/puppet	production	+72 -72
Normalize config/sites.yaml to be machine editable	operations/homer/public	master	+39 -39

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T255792 Quibble runs core:unit tests twice!
Open	None	T328919 Upgrade to PHPUnit 10
Open	None	T338103 Micro-optimize ApiResult::isMetadataKey with str_starts_with once we support PHP8+
Open	None	T328921 Drop PHP 7.4 support from MediaWiki
Stalled	None	T334726 Use return type `never` in Wikibase
Open	None	T328922 Drop PHP 8.0 support from MediaWiki
Stalled	None	T319055 Upgrade to psr/container 2.x
Stalled	Krinkle	T319432 Migrate WMF production from PHP 7.4 to PHP 8.1
Open	None	T291916 Tracking task for Bullseye migrations in production
Stalled	None	T356293 Migrate MW appservers' base images to bullseye
Open	None	T290536 Serve production traffic via Kubernetes
Open	None	T351074 Move servers from the appserver/api cluster to kubernetes
Resolved	kamila	T354413 Reboot issues for mw13[77-83].eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Clement_Goubert mentioned this in T360763: Move 70% of mediawiki external requests to mw on k8s.Mar 22 2024, 11:21 AM

Change #1013536 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1013536

Mentioned in SAL (#wikimedia-operations) [2024-03-25T10:25:15Z] <claime> Depooling mw2336.codfw.wmnet,mw2337.codfw.wmnet,mw2386.codfw.wmnet,mw2387.codfw.wmnet,mw2388.codfw.wmnet,mw2389.codfw.wmnet - T351074

Change #1013536 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 6 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1013536

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2336.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2337.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2386.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2387.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2388.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2389.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2386.codfw.wmnet with OS bullseye completed:

mw2386 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251100_cgoubert_2983726_mw2386.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2388.codfw.wmnet with OS bullseye completed:

mw2388 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251102_cgoubert_2983844_mw2388.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2336.codfw.wmnet with OS bullseye completed:

mw2336 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251104_cgoubert_2983615_mw2336.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2387.codfw.wmnet with OS bullseye completed:

mw2387 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251107_cgoubert_2983773_mw2387.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2389.codfw.wmnet with OS bullseye completed:

mw2389 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251109_cgoubert_2983933_mw2389.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2337.codfw.wmnet with OS bullseye completed:

mw2337 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403251111_cgoubert_2983632_mw2337.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-03-25T12:25:35Z] <claime> Running homer 'cr*codfw*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-03-25T12:34:41Z] <claime> Pooling and uncordoning mw2336.codfw.wmnet,mw2337.codfw.wmnet,mw2386.codfw.wmnet,mw2387.codfw.wmnet,mw2388.codfw.wmnet,mw2389.codfw.wmnet - T351074

Change #1018655 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 eqiad api_appservers to kubernetes

https://gerrit.wikimedia.org/r/1018655

Mentioned in SAL (#wikimedia-operations) [2024-04-10T10:59:42Z] <claime> Depooling mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074

Change #1018655 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 6 eqiad api_appservers to kubernetes

https://gerrit.wikimedia.org/r/1018655

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1421.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1422.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1491.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1492.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1493.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1421.eqiad.wmnet with OS bullseye completed:

mw1421 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101124_cgoubert_1784876_mw1421.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1493.eqiad.wmnet with OS bullseye completed:

mw1493 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101127_cgoubert_1785103_mw1493.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1422.eqiad.wmnet with OS bullseye completed:

mw1422 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101131_cgoubert_1784925_mw1422.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1491.eqiad.wmnet with OS bullseye completed:

mw1491 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101134_cgoubert_1784990_mw1491.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1492.eqiad.wmnet with OS bullseye completed:

mw1492 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101138_cgoubert_1785051_mw1492.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-10T12:01:31Z] <claime> Running homer 'cr*eqiad*' commit 'T351074' and homer 'lsw1-e3-eqiad*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-10T12:11:53Z] <claime> Pooling and uncordoning mw1421.eqiad.wmnet,mw1422.eqiad.wmnet,mw1491.eqiad.wmnet,mw1492.eqiad.wmnet,mw1493.eqiad.wmnet - T351074

Change #1018719 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Move 7 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1018719

Change #1018719 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Move 7 codfw appservers to kubernetes

https://gerrit.wikimedia.org/r/1018719

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2412.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2413.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2414.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2415.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2416.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2417.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2418.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2414.codfw.wmnet with OS bullseye completed:

mw2414 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404110957_cgoubert_1997382_mw2414.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2417.codfw.wmnet with OS bullseye completed:

mw2417 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111000_cgoubert_1997604_mw2417.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2416.codfw.wmnet with OS bullseye completed:

mw2416 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111003_cgoubert_1997528_mw2416.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2415.codfw.wmnet with OS bullseye completed:

mw2415 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111006_cgoubert_1997473_mw2415.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2412.codfw.wmnet with OS bullseye completed:

mw2412 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111009_cgoubert_1997269_mw2412.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2418.codfw.wmnet with OS bullseye completed:

mw2418 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111013_cgoubert_1997681_mw2418.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2413.codfw.wmnet with OS bullseye completed:

mw2413 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111017_cgoubert_1997332_mw2413.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-11T10:37:25Z] <claime> Running homer 'cr*codfw*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-11T10:52:52Z] <claime> Pooling and uncordoning mw2412.codfw.wmnet,mw2413.codfw.wmnet,mw2414.codfw.wmnet,mw2415.codfw.wmnet,mw2416.codfw.wmnet,mw2417.codfw.wmnet,mw2418.codfw.wmnet - T351074

Clement_Goubert mentioned this in T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons).Thu, Apr 11, 12:14 PM

JMeybohm mentioned this in T353464: Migrate wikikube control planes to hardware nodes.Tue, Apr 16, 11:17 AM

Change #1020852 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 appservers from codfw

https://gerrit.wikimedia.org/r/1020852

Mentioned in SAL (#wikimedia-operations) [2024-04-18T10:25:38Z] <claime> Depooling mw2302.codfw.wmnet,mw2303.codfw.wmnet,mw2304.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet for reimage - T351074

Change #1020852 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 6 appservers from codfw

https://gerrit.wikimedia.org/r/1020852

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2302.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2303.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2304.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2332.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2333.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2334.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2303.codfw.wmnet with OS bullseye completed:

mw2303 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181057_cgoubert_3396044_mw2303.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2332.codfw.wmnet with OS bullseye completed:

mw2332 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181100_cgoubert_3396165_mw2332.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2334.codfw.wmnet with OS bullseye completed:

mw2334 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181102_cgoubert_3396281_mw2334.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2304.codfw.wmnet with OS bullseye completed:

mw2304 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181105_cgoubert_3396102_mw2304.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2302.codfw.wmnet with OS bullseye completed:

mw2302 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181110_cgoubert_3396020_mw2302.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2333.codfw.wmnet with OS bullseye completed:

mw2333 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181108_cgoubert_3396227_mw2333.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-18T11:42:04Z] <claime> Running homer 'cr*codfw*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-18T11:52:03Z] <claime> Pooling and uncordoning mw2302.codfw.wmnet,mw2303.codfw.wmnet,mw2304.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet - T351074

Change #1021478 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 6 api_appservers from eqiad

https://gerrit.wikimedia.org/r/1021478

Change #1021478 abandoned by Clément Goubert:

[operations/puppet@production] kubernetes: move 5 api_appservers from eqiad

Reason:

That would only leave 15 api_appservers in eqiad.

https://gerrit.wikimedia.org/r/1021478

Change #1021482 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 4 appservers to kubernetes

https://gerrit.wikimedia.org/r/1021482

Mentioned in SAL (#wikimedia-operations) [2024-04-18T14:12:40Z] <claime> Depooling mw1355.eqiad.wmnet,mw1480.eqiad.wmnet,mw1481.eqiad.wmnet,mw1487.eqiad.wmnet - T351074

Change #1021482 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 4 appservers to kubernetes

https://gerrit.wikimedia.org/r/1021482

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1355.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1480.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1481.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1487.eqiad.wmnet with OS bullseye

Change #1021495 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] site.pp: Switch mw1365 to canary_appserver

https://gerrit.wikimedia.org/r/1021495

I abandoned the CR to move more eqiad api_appservers because it would leave only 15, 4 of them canaries. We still have a bit more margin on the appserver side in eqiad.

Something to note regarding canaries:

eqiad api_appserver canaries: 4/20 (20%), 3 in row A, 1 in row D
eqiad appserver canaries: 5/24 (21%), all in row A
codfw api_appserver canaries: 2/34 (6%)
codfw appserver canaries: 2/35 (6%)

All the canaries in eqiad are in row A except for one in row D. I propose going down to two canaries per cluster (directly re-imaging them to k8s nodes), keeping 1 in row A and 1 in row D for api_appserver, and moving one of the appserver canaries to another row so we can go down to 1 in row A, and 1 somewhere else, to give us a bit of an easier time keeping row-level availability.

All the canaries in codfw are up for decommission, so we don't need to do anything about them.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1481.eqiad.wmnet with OS bullseye completed:

mw1481 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181437_cgoubert_3445367_mw1481.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1487.eqiad.wmnet with OS bullseye completed:

mw1487 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181440_cgoubert_3445469_mw1487.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1355.eqiad.wmnet with OS bullseye completed:

mw1355 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181442_cgoubert_3445214_mw1355.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1480.eqiad.wmnet with OS bullseye completed:

mw1480 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404181445_cgoubert_3445283_mw1480.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-18T15:04:39Z] <claime> Running homer 'cr*eqiad*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-18T15:12:56Z] <claime> Pooling and uncordoning mw1355.eqiad.wmnet,mw1480.eqiad.wmnet,mw1481.eqiad.wmnet,mw1487.eqiad.wmnet - T351074

Mentioned in SAL (#wikimedia-operations) [2024-04-23T10:45:26Z] <claime> Depooling mw1414.eqiad.wmnet,mw1415.eqiad.wmnet,mw1416.eqiad.wmnet,mw1448.eqiad.wmnet,mw1449.eqiad.wmnet for reimage to kubernetes - T351074

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1414.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1415.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1416.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1448.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1449.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1414.eqiad.wmnet with OS bullseye completed:

mw1414 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231110_cgoubert_106747_mw1414.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1449.eqiad.wmnet with OS bullseye completed:

mw1449 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231112_cgoubert_107034_mw1449.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1416.eqiad.wmnet with OS bullseye completed:

mw1416 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231115_cgoubert_106887_mw1416.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1448.eqiad.wmnet with OS bullseye completed:

mw1448 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231117_cgoubert_106976_mw1448.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1415.eqiad.wmnet with OS bullseye completed:

mw1415 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231121_cgoubert_106784_mw1415.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-23T11:39:26Z] <claime> Running homer 'cr*eqiad*' commit 'T351074'

Mentioned in SAL (#wikimedia-operations) [2024-04-23T11:47:26Z] <claime> Pooling and uncordoning mw1414.eqiad.wmnet,mw1415.eqiad.wmnet,mw1416.eqiad.wmnet,mw1448.eqiad.wmnet,mw1449.eqiad.wmnet - T351074

Move servers from the appserver/api cluster to kubernetesOpen, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Move servers from the appserver/api cluster to kubernetes
Open, HighPublic
Actions

Related Objects
Search...