Page MenuHomePhabricator

Update remaining Ganeti servers in eqiad to Bookworm
Closed, ResolvedPublic

Description

Drain, reimage and re-add to cluster:

  • ganeti1023 A
  • ganeti1024 C
  • ganeti1025 A
  • ganeti1026 A
  • ganeti1027 C
  • ganeti1028 C
  • ganeti1029 A
  • ganeti1030 A
  • ganeti1031 A
  • ganeti1032 A
  • ganeti1033 D
  • ganeti1034 D
  • ganeti1035 A
  • ganeti1036 B
  • ganeti1037 C
  • ganeti1038 D

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

VM kubestagemaster1004.eqiad.wmnet switching disk type to drbd

Draining ganeti1036.eqiad.wmnet of running VMs

VM kubestagemaster1004.eqiad.wmnet switching disk type to plain

Draining ganeti1036.eqiad.wmnet of running VMs

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bookworm completed:

  • ganeti1025 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502191254_jmm_1348790_ganeti1025.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=9ff89e50-cdd1-449a-a676-876c36729c2f) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1036.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1036.eqiad.wmnet with OS bookworm

Draining ganeti1026.eqiad.wmnet of running VMs

Draining ganeti1026.eqiad.wmnet of running VMs

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1036.eqiad.wmnet with OS bookworm completed:

  • ganeti1036 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502200820_jmm_1562186_ganeti1036.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=8efe0251-40ee-433b-a080-3bef582e4f79) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1026.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bookworm completed:

  • ganeti1026 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502210702_jmm_1787091_ganeti1026.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti1024.eqiad.wmnet of running VMs

Draining ganeti1024.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=dce06e0b-27de-4e76-8cf6-d4947764ef79) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1024.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1024 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1024.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm

Draining ganeti1030.eqiad.wmnet of running VMs

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bookworm completed:

  • ganeti1024 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502270753_jmm_3504904_ganeti1024.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=4ced1ba3-f166-422d-a9cb-6875dd47d2ed) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1030.eqiad.wmnet

Draining ganeti1027.eqiad.wmnet of running VMs

Draining ganeti1027.eqiad.wmnet of running VMs

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm

VM install1004.wikimedia.org switching disk type to plain

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1027 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1027.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

VM install1004.wikimedia.org switching disk type to plain

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1030.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1030.eqiad.wmnet with OS bookworm completed:

  • ganeti1030 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502280748_jmm_4174070_ganeti1030.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bookworm completed:

  • ganeti1027 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502280849_jmm_9609_ganeti1027.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-03-03T13:24:22Z] <moritzm> failover Ganeti master in eqiad to ganeti1048 T382507

Draining ganeti1031.eqiad.wmnet of running VMs

VM kubestagemaster1005.eqiad.wmnet switching disk type to drbd

Draining ganeti1031.eqiad.wmnet of running VMs

VM kubestagemaster1005.eqiad.wmnet switching disk type to plain

Draining ganeti1031.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=49bd4e46-521c-46ca-9334-5c777206e882) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1031.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1031.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1031.eqiad.wmnet with OS bookworm completed:

  • ganeti1031 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503041457_jmm_3967093_ganeti1031.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti1032.eqiad.wmnet of running VMs

VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to drbd

Draining ganeti1032.eqiad.wmnet of running VMs

VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to plain

VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to drbd

Draining ganeti1032.eqiad.wmnet of running VMs

VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to plain

Draining ganeti1032.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=836a9ab9-c457-4a78-ab8b-24d0332b99af) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1032.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1032.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1032.eqiad.wmnet with OS bookworm completed:

  • ganeti1032 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503060746_jmm_1788021_ganeti1032.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti1035.eqiad.wmnet of running VMs

Draining ganeti1035.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=25ef85c8-8d74-4903-a4fb-449180b148f4) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1035.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1035.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1035.eqiad.wmnet with OS bookworm completed:

  • ganeti1035 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503070846_jmm_3262634_ganeti1035.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti1028.eqiad.wmnet of running VMs

Draining ganeti1028.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=9fc8bc6c-fcab-42ee-95e1-ca8c3f853132) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1028.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bookworm completed:

  • ganeti1028 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503101333_jmm_3582683_ganeti1028.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti1037.eqiad.wmnet of running VMs

Draining ganeti1037.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=43cdf866-0dde-4aee-ad05-0604c388b7b3) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1037.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm completed:

  • ganeti1037 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503120907_jmm_1944109_ganeti1037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti1034.eqiad.wmnet of running VMs

Draining ganeti1034.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=a0399a93-44e5-45af-80d2-7c6886b8bcc5) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1034.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1034.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1034.eqiad.wmnet with OS bookworm completed:

  • ganeti1034 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503130828_jmm_3325262_ganeti1034.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Draining ganeti1029.eqiad.wmnet of running VMs

Draining ganeti1029.eqiad.wmnet of running VMs

Icinga downtime and Alertmanager silence (ID=fbeb54b5-2eb9-44e3-bebb-3ffb0c131169) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1029.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1029.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1029.eqiad.wmnet with OS bookworm completed:

  • ganeti1029 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503181019_jmm_2053879_ganeti1029.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-03-18T12:35:33Z] <moritzm> rebalance ganeti eqiad/A following reimages T382507

Mentioned in SAL (#wikimedia-operations) [2025-03-19T07:51:21Z] <moritzm> rebalance ganeti eqiad/B following reimages T382507

Mentioned in SAL (#wikimedia-operations) [2025-03-20T07:24:37Z] <moritzm> rebalance ganeti eqiad/C following reimages T382507

Mentioned in SAL (#wikimedia-operations) [2025-03-24T07:28:36Z] <moritzm> rebalance ganeti eqiad/D following reimages T382507