Page MenuHomePhabricator

Update Ganeti servers in ulsfo to Bookworm
Closed, ResolvedPublic

Description

Drain, reimage and re-add to cluster:

  • ganeti4005
  • ganeti4006
  • ganeti4007
  • ganeti4008

Event Timeline

Volans triaged this task as Medium priority.Dec 23 2024, 11:36 AM

Draining ganeti4005.ulsfo.wmnet of running VMs

Draining ganeti4005.ulsfo.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=652b6b7b-5164-4a67-b73d-931451743ac2) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti4005.ulsfo.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bookworm completed:

  • ganeti4005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503271009_jmm_882626_ganeti4005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti4006.ulsfo.wmnet of running VMs

Draining ganeti4006.ulsfo.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=8b39a21c-2178-4c2d-85ec-b458f3c9ab46) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti4006.ulsfo.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bookworm completed:

  • ganeti4006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503311156_jmm_514904_ganeti4006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti4007.ulsfo.wmnet of running VMs

Draining ganeti4007.ulsfo.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=754c015a-5966-406b-8711-e527c555dafe) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti4007.ulsfo.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bookworm completed:

  • ganeti4007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202504010741_jmm_1271010_ganeti4007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-04-01T08:36:05Z] <moritzm> failover ganeti master in ulsfo to ganeti4005 T382511

Draining ganeti4008.ulsfo.wmnet of running VMs

Draining ganeti4008.ulsfo.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=0a440a7c-23d5-411a-82dc-b35d0662b15f) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti4008.ulsfo.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bookworm completed:

  • ganeti4008 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202504011314_jmm_1511944_ganeti4008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti6001.drmrs.wmnet of running VMs