Page MenuHomePhabricator

Migrate dse-k8s cluster from docker to containerd
Closed, ResolvedPublic

Description

As per the instructions in https://wikitech.wikimedia.org/wiki/Kubernetes/Administration/containerd_migration

We need to do this in order to unblock the next Kubernetes version upgrade.

A few things to bear in mind:

  • The dse-k8s cluster is currently mixed between bullseye and bookworm.
  • We already use two different partition recipes for different nodes in the cluster, so we may need to adapt the recipes to accommodate the larger kubelet volume.
  • We do not currently use dragonfly, so the instructions for that will not apply to the dse-k8s cluster.
  • Update the recipe in use
  • Puppet changes to make the hosts use containerd as per instructions provided

Reimage the hosts

  • dse-k8s-worker1001.eqiad.wmnet
  • dse-k8s-worker1002.eqiad.wmnet
  • dse-k8s-worker1003.eqiad.wmnet
  • dse-k8s-worker1004.eqiad.wmnet
  • dse-k8s-worker1005.eqiad.wmnet
  • dse-k8s-worker1006.eqiad.wmnet
  • dse-k8s-worker1007.eqiad.wmnet
  • dse-k8s-worker1008.eqiad.wmnet
  • dse-k8s-worker1009.eqiad.wmnet
  • dse-k8s-ctrl1001.eqiad.wmnet
  • dse-k8s-ctrl1002.eqiad.wmnet
  • dse-k8s-etcd1001.eqiad.wmnet
  • dse-k8s-etcd1002.eqiad.wmnet
  • dse-k8s-etcd1003.eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Let's not touch this during December, it seems too risky. But we can already come up with a plan and probably a number of steps as subtasks.

We might also need to look at the rocm packages that are installed on dse-k8s-worker1001, since we don'#t currently have these foor bookworm.
https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU#Upgrade_the_Debian_packages

This will currently prevent a reimage of this host to bookworm, although all of the others can be done.

Change #1118844 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] dse-k8s: Use partman recipes for containerd with local storage support

https://gerrit.wikimedia.org/r/1118844

Change #1118846 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] dse-k8s: Stop installing the amd rocm packages to dse-k8s-worker1001

https://gerrit.wikimedia.org/r/1118846

Change #1118844 merged by Btullis:

[operations/puppet@production] dse-k8s: Use partman recipes for containerd with local storage support

https://gerrit.wikimedia.org/r/1118844

Change #1118846 merged by Btullis:

[operations/puppet@production] dse-k8s: Stop installing the amd rocm packages to dse-k8s-worker1001

https://gerrit.wikimedia.org/r/1118846

Change #1119087 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] update dsek8s cluster to use containerd

https://gerrit.wikimedia.org/r/1119087

Mentioned in SAL (#wikimedia-analytics) [2025-02-12T12:27:14Z] <btullis> draining dse-k8s-worker1001 ready for reimage to bookworm and containerd for T377875

Change #1119087 merged by Stevemunene:

[operations/puppet@production] update dsek8s cluster to use containerd

https://gerrit.wikimedia.org/r/1119087

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm

Change #1119102 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Change dse-k8s-worker1002 to use containerd

https://gerrit.wikimedia.org/r/1119102

Change #1119103 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Change dse-k8s-worker1003 to use containerd

https://gerrit.wikimedia.org/r/1119103

Change #1119104 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Change dse-k8s-worker1004 to use containerd

https://gerrit.wikimedia.org/r/1119104

Change #1119105 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Change dse-k8s-worker1009 to use containerd

https://gerrit.wikimedia.org/r/1119105

Change #1119106 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Remove docker related referrences on dse-k8s worker and master

https://gerrit.wikimedia.org/r/1119106

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker1001.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502121331_stevemunene_233698_dse-k8s-worker1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-02-12T15:13:19Z] <stevemunene> draining dse-k8s-worker1002 ready for reimage to bookworm and containerd for T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm

Change #1119102 merged by Stevemunene:

[operations/puppet@production] Change dse-k8s-worker1002 to use containerd

https://gerrit.wikimedia.org/r/1119102

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker1002.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm executed with errors:

  • dse-k8s-worker1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dse-k8s-worker1002.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1002.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502121702_stevemunene_263701_dse-k8s-worker1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-02-13T09:55:14Z] <stevemunene> draining dse-k8s-worker1003 ready for reimage to bookworm and containerd for T377875

Change #1119103 merged by Stevemunene:

[operations/puppet@production] Change dse-k8s-worker1003 to use containerd

https://gerrit.wikimedia.org/r/1119103

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131050_stevemunene_400512_dse-k8s-worker1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-02-13T12:18:14Z] <stevemunene> draining dse-k8s-worker1004 ready for reimage to bookworm and containerd for T377875

Mentioned in SAL (#wikimedia-analytics) [2025-02-13T12:19:56Z] <stevemunene> draining dse-k8s-worker1004 ready for reimage to bookworm and containerd for T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm

Change #1119104 merged by Stevemunene:

[operations/puppet@production] Change dse-k8s-worker1004 to use containerd

https://gerrit.wikimedia.org/r/1119104

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131245_stevemunene_418579_dse-k8s-worker1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-02-13T13:23:35Z] <stevemunene> draining dse-k8s-worker1005 ready for reimage to bookworm and containerd for T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1005.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131348_stevemunene_428050_dse-k8s-worker1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-02-13T14:29:16Z] <stevemunene> draining dse-k8s-worker1006 ready for reimage to bookworm and containerd for T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1006.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1006.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131457_stevemunene_437953_dse-k8s-worker1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-02-13T16:17:42Z] <stevemunene> draining dse-k8s-worker1007 ready for reimage to bookworm and containerd for T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1007.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1007.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131642_stevemunene_452672_dse-k8s-worker1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-02-13T17:06:38Z] <stevemunene> draining dse-k8s-worker1008 ready for reimage to bookworm and containerd for T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1008.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1008.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1008 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131734_stevemunene_462392_dse-k8s-worker1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-02-13T18:01:10Z] <stevemunene> draining dse-k8s-worker1009 ready for reimage to bookworm and containerd for T377875

Change #1119105 merged by Stevemunene:

[operations/puppet@production] Change dse-k8s-worker1009 to use containerd

https://gerrit.wikimedia.org/r/1119105

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-worker1009 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131822_stevemunene_471569_dse-k8s-worker1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1121335 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Create dse-k8s control panel partman recipes

https://gerrit.wikimedia.org/r/1121335

Change #1121415 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/alerts@master] Fix team name typo for hadoop worker

https://gerrit.wikimedia.org/r/1121415

Change #1121415 merged by jenkins-bot:

[operations/alerts@master] Fix team name typo for hadoop worker

https://gerrit.wikimedia.org/r/1121415

Change #1121335 abandoned by Stevemunene:

[operations/puppet@production] Create dse-k8s control panel partman recipes

Reason:

The dse control panel hosts do not need a different partition recipe

https://gerrit.wikimedia.org/r/1121335

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-ctrl1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-ctrl1001.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-ctrl1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503051451_stevemunene_3248918_dse-k8s-ctrl1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-03-05T15:24:14Z] <stevemunene> draining and depooling dse-k8s-ctrl1002 ready for reimage to bookworm and containerd for T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-ctrl1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-ctrl1002.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-ctrl1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503051541_stevemunene_3295021_dse-k8s-ctrl1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-03-06T06:06:31Z] <stevemunene> removing dse-k8s-etcd1001 from the dse-k8s cluster to allow a reimage to bookworm T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-etcd1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-etcd1001.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-etcd1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503060625_stevemunene_3579285_dse-k8s-etcd1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

For the etcd hosts we shall follow the procedure detailed on wikitech https://wikitech.wikimedia.org/wiki/Etcd#Reimage_nodes_a_cluster and we shall do it one at a time.
First is to check for the leader and proceed to work on the host that is not the leader.

stevemunene@dse-k8s-etcd1001:~$ etcdctl --endpoints=https://dse-k8s-etcd1001.eqiad.wmnet:2379  member list
redacted: name=dse-k8s-etcd1002 peerURLs=https://dse-k8s-etcd1002.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1002.eqiad.wmnet:2379 isLeader=true
redacted: name=dse-k8s-etcd1001 peerURLs=https://dse-k8s-etcd1001.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1001.eqiad.wmnet:2379 isLeader=false
redacted: name=dse-k8s-etcd1003 peerURLs=https://dse-k8s-etcd1003.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1003.eqiad.wmnet:2379 isLeader=false

remove the host then reimage the usual way

stevemunene@dse-k8s-etcd1002:~$ etcdctl -C https://dse-k8s-etcd1002.eqiad.wmnet:2379 member remove 352f48bc392d9ef9
Removed member 352f48bc392d9ef9 from cluster

stevemunene@dse-k8s-etcd1002:~$  etcdctl --endpoints=https://dse-k8s-etcd1002.eqiad.wmnet:2379  member list
redacted: name=dse-k8s-etcd1002 peerURLs=https://dse-k8s-etcd1002.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1002.eqiad.wmnet:2379 isLeader=true
redacted: name=dse-k8s-etcd1003 peerURLs=https://dse-k8s-etcd1003.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1003.eqiad.wmnet:2379 isLeader=false

stevemunene@dse-k8s-etcd1002:~$  etcdctl --endpoints=https://dse-k8s-etcd1002.eqiad.wmnet:2379 cluster-health
member redacted is healthy: got healthy result from https://dse-k8s-etcd1002.eqiad.wmnet:2379
member redacted is healthy: got healthy result from https://dse-k8s-etcd1003.eqiad.wmnet:2379
cluster is healthy

Then re add the reimaged host and run puppet on the host

stevemunene@dse-k8s-etcd1002:~$ etcdctl -C https://dse-k8s-etcd1002.eqiad.wmnet:2379 member add dse-k8s-etcd1001 https://dse-k8s-etcd1001.eqiad.wmnet:2380
Added member named dse-k8s-etcd1001 with ID 881d286caf64a60d to cluster

ETCD_NAME="dse-k8s-etcd1001"
ETCD_INITIAL_CLUSTER="dse-k8s-etcd1002=https://dse-k8s-etcd1002.eqiad.wmnet:2380,dse-k8s-etcd1001=https://dse-k8s-etcd1001.eqiad.wmnet:2380,dse-k8s-etcd1003=https://dse-k8s-etcd1003.eqiad.wmnet:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
stevemunene@dse-k8s-etcd1002:~$ etcdctl --endpoints=https://dse-k8s-etcd1002.eqiad.wmnet:2379 cluster-health
member redacted is healthy: got healthy result from https://dse-k8s-etcd1002.eqiad.wmnet:2379
member redacted is unreachable: no available published client urls
member redacted is healthy: got healthy result from https://dse-k8s-etcd1003.eqiad.wmnet:2379
cluster is degraded
stevemunene@dse-k8s-etcd1002:~$ etcdctl --endpoints=https://dse-k8s-etcd1002.eqiad.wmnet:2379 cluster-health
member redacted is healthy: got healthy result from https://dse-k8s-etcd1002.eqiad.wmnet:2379
member redacted is healthy: got healthy result from https://dse-k8s-etcd1001.eqiad.wmnet:2379
member redacted is healthy: got healthy result from https://dse-k8s-etcd1003.eqiad.wmnet:2379
cluster is healthy
stevemunene@dse-k8s-etcd1002:~$ etcdctl --endpoints=https://dse-k8s-etcd1002.eqiad.wmnet:2379 member list
redacted: name=dse-k8s-etcd1002 peerURLs=https://dse-k8s-etcd1002.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1002.eqiad.wmnet:2379 isLeader=true
redacted: name=dse-k8s-etcd1001 peerURLs=https://dse-k8s-etcd1001.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1001.eqiad.wmnet:2379 isLeader=false
redacted: name=dse-k8s-etcd1003 peerURLs=https://dse-k8s-etcd1003.eqiad.wmnet:2380 clientURLs=https://dse-k8s-etcd1003.eqiad.wmnet:2379 isLeader=false

Then proceed with the rest of the hosts

Mentioned in SAL (#wikimedia-analytics) [2025-03-06T07:31:40Z] <stevemunene> removing dse-k8s-etcd1003 from the dse-k8s cluster to allow a reimage to bookworm T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-etcd1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-etcd1003.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-etcd1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503060756_stevemunene_3600322_dse-k8s-etcd1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2025-03-06T13:30:36Z] <stevemunene> initialize election of new leader on dse-k8s-etcd cluster to allow reimage of dse-k8s-etcd1002 T377875

Reimage of the [] dse-k8s-etcd1002.eqiad.wmnet instance ran into some challenges as the recently upgraded hosts automatically picked up version 3.4.23 while the previously installed version was v3.3.25. There are some breaking changes as per the changelog but most pertain to the v2 api which we rarely use.

The challenge now being that we have a leader of the 3 node cluster running on a lower version and are having a challenge forcing an election of a new leader as described in the documentation. The alternative to this is temporarily stopping the service and allowing the new hosts to choose a leader which could require some minimal down/slow time then reimage once all is done which should work in theory.

Mentioned in SAL (#wikimedia-analytics) [2025-03-10T09:56:10Z] <stevemunene> stop etcd.service on dse-k8s-etcd1002 to initialise election of new leader cluster to allow reimage of dse-k8s-etcd1002 T377875

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host dse-k8s-etcd1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host dse-k8s-etcd1002.eqiad.wmnet with OS bookworm completed:

  • dse-k8s-etcd1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503101036_stevemunene_2443417_dse-k8s-etcd1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Stevemunene updated the task description. (Show Details)

Dse-k8s cluster migrated to containerd and all the hosts are running on bookworm

image.png (1×2 px, 1 MB)

Change #1119106 merged by Stevemunene:

[operations/puppet@production] Remove docker related referrences on dse-k8s worker and master

https://gerrit.wikimedia.org/r/1119106