Page MenuHomePhabricator

Upgrade DSE to k8s 1.23
Closed, ResolvedPublic

Description

The procedure should be the following:

  1. Add new Intermediate PKI CAs for DSE (requirement for k8s 1.23)
  2. Prepare puppet and deployment-charts changes. Examples:

Don't merge them yet :)

  1. Run the sre.k8s.upgrade cookbook on a cumin node (without the dry run):
sudo cookbook --dry-run sre.k8s.upgrade-cluster --reason "Upgrade to k8s 1.23" --k8s-cluster "dse-eqiad" --os bullseye --etcd-wipe-only

As part of the cookbook you'll need to merge the patches prepared in 1), the cookbook will ask for them at the right moment.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is there a known timeline / ETA for this upgrade?

@gmodena are you doing any experiments on it? If not I can try to do it tomorrow or Friday, I have to wipe everything so this is why I am asking :)

Change 891280 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::dse_k8s::{master,worker}: update settings to k8s 1.23

https://gerrit.wikimedia.org/r/891280

Change 891284 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: upgrade the DSE cluster to k8s 1.23

https://gerrit.wikimedia.org/r/891284

@elukey we have some Flink ops tasks this sprint that will require re-deploying our PoC app on DSE. It's unlikely that we'll deploy this week, and if you give us a maintenance window (or some heads up) we'll work around it.
tl;dr: let me know when we can use DSE, so that we don't get in your way :).

cc / @Ottomata

@gmodena I think elukey can just do this anytime, no? We don't mind if our stuff is deleted, we can redeploy, right?

@Ottomata absolutely. Just wanted to sync so we avoid attempting deployments / experiments during a maintenance window.

Change 891321 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pki::root_ca: add new pkis for the DSE k8s cluster

https://gerrit.wikimedia.org/r/891321

@gmodena we'll probably do it on Friday, there is some prep-work to be done and it will be done tomorrow :)

Change 891321 merged by Elukey:

[operations/puppet@production] profile::pki::root_ca: add new pkis for the DSE k8s cluster

https://gerrit.wikimedia.org/r/891321

Change 891344 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Add fake intermediate PKI key for DSE k8s

https://gerrit.wikimedia.org/r/891344

Change 891344 merged by Elukey:

[labs/private@master] Add fake intermediate PKI key for DSE k8s

https://gerrit.wikimedia.org/r/891344

Change 891346 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add K8s DSE intermediate PKI configs and public certs

https://gerrit.wikimedia.org/r/891346

Change 891346 merged by Elukey:

[operations/puppet@production] Add K8s DSE intermediate PKI configs and public certs

https://gerrit.wikimedia.org/r/891346

New PKI intermediates added, puppet/deployment-charts changes are out for review as well :)

Icinga downtime and Alertmanager silence (ID=74cbf082-10f9-46b9-9315-de46465fbfba) set by elukey@cumin1001 for 8:00:00 on 8 host(s) and their services with reason: Downtime DSE workers for cluster upgrade

dse-k8s-worker[1001-1008].eqiad.wmnet

Change 891280 merged by Elukey:

[operations/puppet@production] role::dse_k8s::{master,worker}: update settings to k8s 1.23

https://gerrit.wikimedia.org/r/891280

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302240911_elukey_3529676_dse-k8s-worker1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302240910_elukey_3529569_dse-k8s-worker1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302240911_elukey_3529799_dse-k8s-worker1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Change 891284 merged by Elukey:

[operations/deployment-charts@master] admin_ng: upgrade the DSE cluster to k8s 1.23

https://gerrit.wikimedia.org/r/891284

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302240912_elukey_3530060_dse-k8s-worker1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302240913_elukey_3530187_dse-k8s-worker1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Full exception:

Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 499, in run
    fingerprint = self.puppet_installer.regenerate_certificate()[self.fqdn]
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 276, in regenerate_certificate
    raise PuppetHostsError(
spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are:
dse-k8s-worker1006.eqiad.wmnet: Error: request https://puppet:8140//puppet-ca/v1/certificate/ca failed: Failed to open TCP connection to puppet:8140 (getaddrinfo: Temporary failure in name resolution)
dse-k8s-worker1006.eqiad.wmnet: Error: Could not request certificate: Failed to open TCP connection to puppet:8140 (getaddrinfo: Temporary failure in name resolution)
**The reimage failed, see the cookbook logs for the details**
Reimage executed with errors:
- dse-k8s-worker1006 (**FAIL**)
  - Downtimed on Icinga/Alertmanager
  - //Unable to disable Puppet, the host may have been unreachable//
  - Removed from Puppet and PuppetDB if present
  - Deleted any existing Puppet certificate
  - Removed from Debmonitor if present
  - Forced PXE for next reboot
  - Host rebooted via IPMI
  - Host up (Debian installer)
  - Host up (new fresh bullseye OS)
  - **The reimage failed, see the cookbook logs for the details**

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details
elukey changed the task status from Open to Stalled.Feb 24 2023, 2:49 PM

There is a problem/bug triggered while reimaging nodes in row E/F in eqiad, tracked in T306421. Until it is fixed we cannot really complete the reimages of the dse1005->1008 workers.

Icinga downtime and Alertmanager silence (ID=b1fe3d5f-2fa2-4c9e-92d5-c7b84f294e1e) set by elukey@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Cluster half broken, in the middle of upgrading

dse-k8s-worker1007.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • The reimage failed, see the cookbook logs for the details

Icinga downtime and Alertmanager silence (ID=42315681-7e02-4c17-bd5c-eef6680a2aa9) set by elukey@cumin1001 for 4 days, 0:00:00 on 5 host(s) and their services with reason: Downtime DSE workers for cluster upgrade

dse-k8s-worker[1001-1004,1007].eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271414_cmooney_239320_dse-k8s-worker1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271454_elukey_249516_dse-k8s-worker1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271454_elukey_249516_dse-k8s-worker1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271543_elukey_263784_dse-k8s-worker1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271544_elukey_263871_dse-k8s-worker1008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271544_elukey_263871_dse-k8s-worker1008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
    • The reimage failed, see the cookbook logs for the details

Change 892519 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: add istio network policy config for DSE

https://gerrit.wikimedia.org/r/892519

Change 892522 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::dse_k8s::worker: update istio-cni version

https://gerrit.wikimedia.org/r/892522

Change 892522 merged by Elukey:

[operations/puppet@production] role::dse_k8s::worker: update istio-cni version

https://gerrit.wikimedia.org/r/892522

Change 892519 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: add istio network policy config for DSE

https://gerrit.wikimedia.org/r/892519

The DSE cluster is on k8s 1.23! I deployed everything up to istio/cfssl, we'll do more as soon as we need. There seems to be an issue with hosts in row E/F (see T306421) but I managed to bootstrap all of them anyway.

elukey changed the task status from Stalled to Open.Feb 27 2023, 6:05 PM
elukey triaged this task as Medium priority.

Change 892870 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: refactor DSE 1.23 config and disable istio sidecars in ns

https://gerrit.wikimedia.org/r/892870

Change 892870 merged by Elukey:

[operations/deployment-charts@master] admin_ng: refactor DSE 1.23 config and disable istio sidecars in ns

https://gerrit.wikimedia.org/r/892870