Page MenuHomePhabricator

Reimage physical lists hosts to have public IPs
Closed, ResolvedPublic

Description

The physical hosts (lists1004, lists2001) have private IPs, which won't work for incoming mail. Reimage them with these instructions (but keeping the same name)

  • lists1004
  • lists2001

Event Timeline

cookbooks.sre.hosts.decommission executed by eoghan@cumin1002 for hosts: lists2001.codfw.wmnet

  • lists2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1002 for host lists2001.wikimedia.org with OS bookworm

LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1002 for host lists2001.wikimedia.org with OS bookworm executed with errors:

  • lists2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" lists2001.wikimedia.org to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bullseye

The attempt with bookworm started by Eoghan was stuck at the partitioning step in the Debian installer with "No root file system is defined".

Saw this via ssh root@lists2001.mgmt.codfw.wmnet -> console com2.

Attempted to reimage with bullseye and it went past the partitioning step and kept installing, with no change to the partman config.

Might this be a newer hardware model that nobody has installed with a bookworm yet? Or maybe no other machine using the raid-2dev config has been bookworm so far?

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bullseye completed:

  • lists2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404262121_dzahn_2467925_lists2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bookworm

Somehow it worked on the next attempt with bookworm as well. It must have been a fluke. Host is up now with bookworm, no config change to before.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bookworm completed:

  • lists2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404262207_dzahn_2507991_lists2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

cookbooks.sre.hosts.decommission executed by aokoth@cumin1002 for hosts: lists1004.eqiad.wmnet

  • lists1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm executed with errors:

  • lists1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" lists1004.wikimedia.org to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm completed:

  • lists1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404302021_aokoth_3611127_lists1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Both hosts have now been reprovisioned with public IPs. Thanks @Arnoldokoth for taking care of lists1004!