Page MenuHomePhabricator

Update Ganeti servers in drmrs to Bookworm
Closed, ResolvedPublic

Description

Drain, reimage and re-add to cluster:

  • ganeti6001 B12
  • ganeti6002 B13
  • ganeti6003 B12
  • ganeti6004 B13

Event Timeline

Volans triaged this task as Medium priority.Dec 23 2024, 11:36 AM

Draining ganeti6004.drmrs.wmnet of running VMs

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6004.drmrs.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6004.drmrs.wmnet with OS bookworm completed:

  • ganeti6004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507021017_jmm_3957476_ganeti6004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti6002.drmrs.wmnet of running VMs

Mentioned in SAL (#wikimedia-operations) [2025-07-02T13:30:32Z] <moritzm> failover Ganeti master in drmrs02 to ganeti6004 T382513

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6002.drmrs.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6002.drmrs.wmnet with OS bookworm completed:

  • ganeti6002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507021432_jmm_4010244_ganeti6002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti6003.drmrs.wmnet of running VMs

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6003.drmrs.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6003.drmrs.wmnet with OS bookworm completed:

  • ganeti6003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507030853_jmm_195039_ganeti6003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti6001.drmrs.wmnet of running VMs

Mentioned in SAL (#wikimedia-operations) [2025-07-04T06:32:07Z] <moritzm> failover Ganeti master in drmrs01 to ganeti6003 T382513

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti6001.drmrs.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti6001.drmrs.wmnet with OS bookworm completed:

  • ganeti6001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507040743_jmm_673023_ganeti6001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-07-08T06:35:47Z] <moritzm> rebalance following reimages T382513