Page MenuHomePhabricator

Migrate cloudweb, cloudbackup, cloudmetrics physical servers off buster
Closed, ResolvedPublic

Description

According to https://os-reports.wikimedia.org/buster.html, the following cloud roles need migrated off buster:

role::wmcs::ceph::mon (6 host(s))
role::wmcs::ceph::osd (20 host(s))
role::wmcs::monitoring (2 host(s))
role::wmcs::openstack::codfw1dev::cloudweb (1 host(s))
role::wmcs::openstack::eqiad1::backy (2 host(s))
role::wmcs::openstack::eqiad1::labweb (2 host(s))

This equates to these physical hosts:

  • cloudbackup1003.eqiad.wmnet
  • cloudbackup1004.eqiad.wmnet
  • cloudmetrics1003.eqiad.wmnet (reclaimed T351077)
  • cloudmetrics1004.eqiad.wmnet (reclaimed T351077)
  • cloudweb1003.wikimedia.org (now bullseye)
  • cloudweb1004.wikimedia.org (now bullseye)
  • cloudweb2002-dev.wikimedia.org (now bullseye)

Ceph hosts are already covered / tracked by T309789.

Event Timeline

Note, WMCS would like to migrate directly to bookworm if possible.

Change #1023466 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs VM backups: move all backups to one host, cloudbackup1004

https://gerrit.wikimedia.org/r/1023466

Change #1023467 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs VM backups: move all backups to one host, cloudbackup1003

https://gerrit.wikimedia.org/r/1023467

Change #1023468 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "wmcs VM backups: move all backups to one host, cloudbackup1003" Revert "wmcs VM backups: move all backups to one host, cloudbackup1004"

https://gerrit.wikimedia.org/r/1023468

Change #1023466 merged by Andrew Bogott:

[operations/puppet@production] wmcs VM backups: move all backups to one host, cloudbackup1004

https://gerrit.wikimedia.org/r/1023466

It is safe to reimage cloudbackup1003 on April 30.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudbackup1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudbackup1003.eqiad.wmnet with OS bookworm completed:

  • cloudbackup1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404290217_andrew_2941912_cloudbackup1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1023467 merged by Andrew Bogott:

[operations/puppet@production] wmcs VM backups: move all backups to one host, cloudbackup1003

https://gerrit.wikimedia.org/r/1023467

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudbackup1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudbackup1004.eqiad.wmnet with OS bookworm completed:

  • cloudbackup1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405061353_andrew_839804_cloudbackup1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1023468 merged by Andrew Bogott:

[operations/puppet@production] Revert "wmcs VM backups: move all backups to one host"

https://gerrit.wikimedia.org/r/1023468

Andrew updated the task description. (Show Details)