⚓ T329664 Update wikikube codfw to k8s 1.23

Subject	Repo	Branch	Lines +/-
custom.d: update istio configs for k8s 1.23	operations/deployment-charts	master	+3 -3
conftool: add kubernetes202[3,4] to kubesvc	operations/puppet	production	+2 -0
role::kubernetes::{master,worker}: add kubernetes202[34]	operations/puppet	production	+6 -2
restbase: Update kubernetes ip ranges	operations/puppet	production	+6 -2
Add kubernetes202[3,4] to its k8s_neighbors list	operations/homer/public	master	+2 -0
Add kubernetes202[3,4] to the wikikube-codfw cluster	operations/puppet	production	+3 -2
admin_ng: Update wikikube-codfw settings to k8s 1.23	operations/deployment-charts	master	+26 -17
wikikube: Update cluster settings for k8s 1.23	operations/puppet	production	+60 -10

		Status	Subtype	Assigned	Task
		Resolved		JMeybohm	T307943 Update Kubernetes clusters to v1.23
		Resolved		JMeybohm	T329664 Update wikikube codfw to k8s 1.23

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubetcd2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubetcd2006.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2023-02-21T11:24:41Z] <jayme@cumin1001> END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664

Mentioned in SAL (#wikimedia-operations) [2023-02-21T11:25:27Z] <jayme@cumin1001> START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubetcd2006.codfw.wmnet with OS bullseye executed with errors:

kubetcd2006 (FAIL)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211100_jayme_2119277_kubetcd2006.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubetcd2005.codfw.wmnet with OS bullseye completed:

kubetcd2005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211059_jayme_2118867_kubetcd2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed

Cookbook cookbooks.sre.ganeti.reimage started by root@cumin1001 for host kubetcd2004.codfw.wmnet with OS bullseye completed:

kubetcd2004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211059_root_2118175_kubetcd2004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed

Mentioned in SAL (#wikimedia-operations) [2023-02-21T12:34:33Z] <jayme@cumin1001> END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664

Mentioned in SAL (#wikimedia-operations) [2023-02-21T12:35:42Z] <jayme@cumin1001> START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2016.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2015.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2023.codfw.wmnet with OS bullseye

JMeybohm updated the task description. (Show Details)Feb 21 2023, 12:46 PM

JMeybohm updated the task description. (Show Details)Feb 21 2023, 12:48 PM

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2008.codfw.wmnet with OS bullseye

JMeybohm updated the task description. (Show Details)Feb 21 2023, 12:50 PM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2013.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2011.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2012.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2014.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2024.codfw.wmnet with OS bullseye

JMeybohm updated the task description. (Show Details)Feb 21 2023, 12:59 PM

JMeybohm updated the task description. (Show Details)Feb 21 2023, 1:08 PM

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2005.codfw.wmnet with OS bullseye completed:

kubernetes2005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211239_jayme_2395245_kubernetes2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2016.codfw.wmnet with OS bullseye completed:

kubernetes2016 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211240_jayme_2396293_kubernetes2016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2006.codfw.wmnet with OS bullseye completed:

kubernetes2006 (WARN)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211239_jayme_2395507_kubernetes2006.out
- Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2015.codfw.wmnet with OS bullseye completed:

kubernetes2015 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211240_jayme_2396018_kubernetes2015.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2023.codfw.wmnet with OS bullseye executed with errors:

kubernetes2023 (FAIL)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211243_jayme_2404746_kubernetes2023.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye completed:

kubernetes2020 (WARN)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211243_jayme_2404266_kubernetes2020.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2010.codfw.wmnet with OS bullseye completed:

kubernetes2010 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211243_jayme_2403367_kubernetes2010.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2007.codfw.wmnet with OS bullseye executed with errors:

kubernetes2007 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211249_elukey_2417790_kubernetes2007.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2024.codfw.wmnet with OS bullseye executed with errors:

kubernetes2024 (FAIL)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211257_jayme_2442995_kubernetes2024.out
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2013.codfw.wmnet with OS bullseye executed with errors:

kubernetes2013 (FAIL)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211251_jayme_2423672_kubernetes2013.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2012.codfw.wmnet with OS bullseye completed:

kubernetes2012 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211254_elukey_2432650_kubernetes2012.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2008.codfw.wmnet with OS bullseye executed with errors:

kubernetes2008 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211250_elukey_2422510_kubernetes2008.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

JMeybohm updated the task description. (Show Details)Feb 21 2023, 1:33 PM

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye executed with errors:

kubernetes2022 (FAIL)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211257_jayme_2441599_kubernetes2022.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2011.codfw.wmnet with OS bullseye completed:

kubernetes2011 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211253_elukey_2429025_kubernetes2011.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2014.codfw.wmnet with OS bullseye completed:

kubernetes2014 (WARN)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211255_jayme_2436903_kubernetes2014.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 890392 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Update wikikube-codfw settings to k8s 1.23

https://gerrit.wikimedia.org/r/890392

elukey updated the task description. (Show Details)Feb 21 2023, 1:45 PM

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2009.codfw.wmnet with OS bullseye completed:

kubernetes2009 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211314_jayme_2496675_kubernetes2009.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

JMeybohm updated the task description. (Show Details)Feb 21 2023, 1:52 PM

Change 890824 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add kubernetes202[3,4] to the wikikube-codfw cluster

https://gerrit.wikimedia.org/r/890824

Change 890832 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kubernetes::{master,worker}: add kubernetes202[34]

https://gerrit.wikimedia.org/r/890832

Change 890833 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] conftool: add kubernetes202[3,4] to kubesvc

https://gerrit.wikimedia.org/r/890833

Change 890834 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Add kubernetes202[3,4] to its k8s_neighbors list

https://gerrit.wikimedia.org/r/890834

Change 890824 merged by Elukey:

[operations/puppet@production] Add kubernetes202[3,4] to the wikikube-codfw cluster

https://gerrit.wikimedia.org/r/890824

Change 890834 merged by Elukey:

[operations/homer/public@master] Add kubernetes202[3,4] to its k8s_neighbors list

https://gerrit.wikimedia.org/r/890834

Change 890838 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] restbase: Update kubernetes ip ranges

https://gerrit.wikimedia.org/r/890838

Change 890838 merged by JMeybohm:

[operations/puppet@production] restbase: Update kubernetes ip ranges

https://gerrit.wikimedia.org/r/890838

Change 890832 merged by Elukey:

[operations/puppet@production] role::kubernetes::{master,worker}: add kubernetes202[34]

https://gerrit.wikimedia.org/r/890832

Change 890833 merged by Elukey:

[operations/puppet@production] conftool: add kubernetes202[3,4] to kubesvc

https://gerrit.wikimedia.org/r/890833

JMeybohm updated the task description. (Show Details)Feb 21 2023, 3:27 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-21T15:29:42Z] <jayme@cumin1001> END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: T329664

JMeybohm updated the task description. (Show Details)Feb 21 2023, 3:36 PM

JMeybohm updated the task description. (Show Details)

JMeybohm updated the task description. (Show Details)Feb 21 2023, 3:40 PM

JMeybohm updated the task description. (Show Details)

bking updated the task description. (Show Details)Feb 21 2023, 3:43 PM

akosiaris updated the task description. (Show Details)Feb 21 2023, 5:33 PM

Scale down persisted.
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890882

Clement_Goubert updated the task description. (Show Details)Feb 21 2023, 5:41 PM

I noticed that the Istio Gateway and Istio Control Plane dashboards are missing metrics, maybe something changed?

I just noticed job/service Prometheus probes in codfw are flapping a bit since the upgrade to 1.23. It seems there are significantly more "context deadline exceeded" than before. I noticed this for miscweb, but other Kubernetes services are affected too:
https://logstash.wikimedia.org/goto/c445d50124df1dcd85739700a26fd9bc
For comparison before the upgrade:
https://logstash.wikimedia.org/goto/1184f322d2d08333ae3c5ec5b85524e9

I was not able to find this timeouts in Grafana metrics, but I'll add the dashboard link if I found those.

Maintenance_bot removed a project: Patch-For-Review.Feb 22 2023, 9:10 AM

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2017.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye

dcausse mentioned this in T330236: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?).Feb 22 2023, 10:08 AM

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2017.codfw.wmnet with OS bullseye executed with errors:

kubernetes2017 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302220946_elukey_2901390_kubernetes2017.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye executed with errors:

kubernetes2018 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221004_elukey_2907599_kubernetes2018.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye executed with errors:

kubernetes2020 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221004_elukey_2907857_kubernetes2020.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye completed:

kubernetes2021 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221005_elukey_2907956_kubernetes2021.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

The 2017->2021 nodes have been reimaged, and they are now cordoned to wait for ServiceOps' final check.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye completed:

kubernetes2019 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221004_elukey_2907698_kubernetes2019.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

gmodena subscribed.Feb 22 2023, 11:04 AM

The extra hosts have been re-imaged, the cluster has been put back in rotation, serving traffic successfully. I am resolving this. \o/

Hi, just FYI, this did cause some issues in the Analytics Cluster. Context here: https://phabricator.wikimedia.org/T330236#8637831

This isn't your fault, more of a design flaw in a Hadoop ingestion monitoring pipeline. I don't have a better idea atm though.

Change 891508 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom.d: update istio configs for k8s 1.23

https://gerrit.wikimedia.org/r/891508

gerritbot added a project: Patch-For-Review.Feb 23 2023, 11:16 AM

Change 891508 merged by jenkins-bot:

[operations/deployment-charts@master] custom.d: update istio configs for k8s 1.23

https://gerrit.wikimedia.org/r/891508

akosiaris mentioned this in T331126: Update wikikube eqiad to k8s 1.23.Mar 6 2023, 10:59 AM

Update wikikube codfw to k8s 1.23
Closed, ResolvedPublic
Actions

Description

Issues

Details

Related Objects
Search...

Event Timeline

	JMeybohm
	Feb 14 2023, 6:31 PM

Update wikikube codfw to k8s 1.23Closed, ResolvedPublicActions