Page MenuHomePhabricator

Update wikikube codfw to k8s 1.23
Closed, ResolvedPublic

Description

This is scheduled for Feb 21st - 09:00-16:00 UTC (actual downtime of the cluster should be smaller than this window) and we will piggyback on T327991: codfw row B switches upgrade as codfw will be depooled for that anyways.
Some relevant (in this context) hosts will be affected by the 30min downtime during the switch upgrade. Ideally reimageing those hosts should be completed before 14:00 UTC:

  • kubetcd2006
  • kubemaster2002
  • kubernetes[2006,2009-2010,2020,2023]

Todos:

Detailed steps and commands can be found in T326340: Update staging-codfw to k8s 1.23

Issues

Reimage of kubenetes2017-2021 fails with "Unable to establish IPMI v2 / RMCP+ session" (probably caused by T328832 / T330048). That means we're down 5 nodes. We have kubernetes2023-2024 in role::insetup, so we could compensate for 2 of them. Alternatively we could just run puppet (without reimage) on kubenetes2017-2021 which should work as well. I tried that in pontoon, but never on real workers.

Due to the cluster missing 3 nodes I:

  • scaled thumbor down to 1 replica
  • scaled mw-api-ext/mediawiki-main from 4 to 2 replicas
  • scaled mw-debug/mediawiki-pinkunicorn from 2 to 1 replicas
  • scaled mw-web/mediawiki-main from 8 to 4 replicas

As I did that manually using kubectl scale this needs to be persisted in deployment-charts repo for mw deployments in order to not be overridden by scap.

  • Persist scale down of mw deployments in codfw

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubetcd2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubetcd2006.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2023-02-21T11:24:41Z] <jayme@cumin1001> END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664

Mentioned in SAL (#wikimedia-operations) [2023-02-21T11:25:27Z] <jayme@cumin1001> START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubetcd2006.codfw.wmnet with OS bullseye executed with errors:

  • kubetcd2006 (FAIL)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211100_jayme_2119277_kubetcd2006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubetcd2005.codfw.wmnet with OS bullseye completed:

  • kubetcd2005 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211059_jayme_2118867_kubetcd2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Cookbook cookbooks.sre.ganeti.reimage started by root@cumin1001 for host kubetcd2004.codfw.wmnet with OS bullseye completed:

  • kubetcd2004 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211059_root_2118175_kubetcd2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Mentioned in SAL (#wikimedia-operations) [2023-02-21T12:34:33Z] <jayme@cumin1001> END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664

Mentioned in SAL (#wikimedia-operations) [2023-02-21T12:35:42Z] <jayme@cumin1001> START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2016.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubernetes2015.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2023.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2008.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2013.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2011.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2012.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2014.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2024.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2005.codfw.wmnet with OS bullseye completed:

  • kubernetes2005 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211239_jayme_2395245_kubernetes2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2016.codfw.wmnet with OS bullseye completed:

  • kubernetes2016 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211240_jayme_2396293_kubernetes2016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1001 for host kubernetes2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2006.codfw.wmnet with OS bullseye completed:

  • kubernetes2006 (WARN)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211239_jayme_2395507_kubernetes2006.out
    • Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubernetes2015.codfw.wmnet with OS bullseye completed:

  • kubernetes2015 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302211240_jayme_2396018_kubernetes2015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2023.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2023 (FAIL)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211243_jayme_2404746_kubernetes2023.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye completed:

  • kubernetes2020 (WARN)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211243_jayme_2404266_kubernetes2020.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2010.codfw.wmnet with OS bullseye completed:

  • kubernetes2010 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211243_jayme_2403367_kubernetes2010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2007.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211249_elukey_2417790_kubernetes2007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2024.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2024 (FAIL)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211257_jayme_2442995_kubernetes2024.out
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2013.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2013 (FAIL)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211251_jayme_2423672_kubernetes2013.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2012.codfw.wmnet with OS bullseye completed:

  • kubernetes2012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211254_elukey_2432650_kubernetes2012.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2008.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211250_elukey_2422510_kubernetes2008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2022 (FAIL)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211257_jayme_2441599_kubernetes2022.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2011.codfw.wmnet with OS bullseye completed:

  • kubernetes2011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211253_elukey_2429025_kubernetes2011.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2014.codfw.wmnet with OS bullseye completed:

  • kubernetes2014 (WARN)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211255_jayme_2436903_kubernetes2014.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 890392 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Update wikikube-codfw settings to k8s 1.23

https://gerrit.wikimedia.org/r/890392

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1001 for host kubernetes2009.codfw.wmnet with OS bullseye completed:

  • kubernetes2009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211314_jayme_2496675_kubernetes2009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 890824 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add kubernetes202[3,4] to the wikikube-codfw cluster

https://gerrit.wikimedia.org/r/890824

Change 890832 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kubernetes::{master,worker}: add kubernetes202[34]

https://gerrit.wikimedia.org/r/890832

Change 890833 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] conftool: add kubernetes202[3,4] to kubesvc

https://gerrit.wikimedia.org/r/890833

Change 890834 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Add kubernetes202[3,4] to its k8s_neighbors list

https://gerrit.wikimedia.org/r/890834

Change 890824 merged by Elukey:

[operations/puppet@production] Add kubernetes202[3,4] to the wikikube-codfw cluster

https://gerrit.wikimedia.org/r/890824

Change 890834 merged by Elukey:

[operations/homer/public@master] Add kubernetes202[3,4] to its k8s_neighbors list

https://gerrit.wikimedia.org/r/890834

Change 890838 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] restbase: Update kubernetes ip ranges

https://gerrit.wikimedia.org/r/890838

Change 890838 merged by JMeybohm:

[operations/puppet@production] restbase: Update kubernetes ip ranges

https://gerrit.wikimedia.org/r/890838

Change 890832 merged by Elukey:

[operations/puppet@production] role::kubernetes::{master,worker}: add kubernetes202[34]

https://gerrit.wikimedia.org/r/890832

Change 890833 merged by Elukey:

[operations/puppet@production] conftool: add kubernetes202[3,4] to kubesvc

https://gerrit.wikimedia.org/r/890833

Mentioned in SAL (#wikimedia-operations) [2023-02-21T15:29:42Z] <jayme@cumin1001> END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: T329664

JMeybohm updated the task description. (Show Details)
JMeybohm updated the task description. (Show Details)

I noticed that the Istio Gateway and Istio Control Plane dashboards are missing metrics, maybe something changed?

I just noticed job/service Prometheus probes in codfw are flapping a bit since the upgrade to 1.23. It seems there are significantly more "context deadline exceeded" than before. I noticed this for miscweb, but other Kubernetes services are affected too:
https://logstash.wikimedia.org/goto/c445d50124df1dcd85739700a26fd9bc
For comparison before the upgrade:
https://logstash.wikimedia.org/goto/1184f322d2d08333ae3c5ec5b85524e9

I was not able to find this timeouts in Grafana metrics, but I'll add the dashboard link if I found those.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2017.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2017.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2017 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302220946_elukey_2901390_kubernetes2017.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2018 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221004_elukey_2907599_kubernetes2018.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2020 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221004_elukey_2907857_kubernetes2020.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye completed:

  • kubernetes2021 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221005_elukey_2907956_kubernetes2021.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The 2017->2021 nodes have been reimaged, and they are now cordoned to wait for ServiceOps' final check.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye completed:

  • kubernetes2019 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302221004_elukey_2907698_kubernetes2019.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
akosiaris subscribed.

The extra hosts have been re-imaged, the cluster has been put back in rotation, serving traffic successfully. I am resolving this. \o/

Hi, just FYI, this did cause some issues in the Analytics Cluster. Context here: https://phabricator.wikimedia.org/T330236#8637831

This isn't your fault, more of a design flaw in a Hadoop ingestion monitoring pipeline. I don't have a better idea atm though.

Change 891508 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom.d: update istio configs for k8s 1.23

https://gerrit.wikimedia.org/r/891508

Change 891508 merged by jenkins-bot:

[operations/deployment-charts@master] custom.d: update istio configs for k8s 1.23

https://gerrit.wikimedia.org/r/891508