Page MenuHomePhabricator

installation tracking for hosts affected by magru re-shuffle
Closed, ResolvedPublic

Description

Due to the issues noted on T376737, we're going to have to reimage a large number of ops-magru hosts. This task will have the per host checklist for each item and what needs to be checked for each to return to service.

Migration Step Checklist

Rob has compiled the following order of operation after day 1 server swaps. Completing these in any other order may not result in two hosts being ready for reimage:

  • Support Swaps each host pair around, Rob remotely confirms hardware is correct and remotely accessible.
  • Take screen shots / note the network port and cable ID info for both host A and B.
  • Run decommission cookbook for both hosts A and B.
  • Pull up both A and B in netbox, swap around A and B hostname + rack location, set both from Decommissioning to Planned state.
  • Pull up host A's mgmt ip and enter udpated fqdn under dns name for the mgmt ip entry.
  • Run Provision Host Network details for host A, use host B's netowkr port and cable info recorded earlier.
  • Pull up host B's mgmt ip and enter udpated fqdn under dns name for the mgmt ip entry.
  • Run Provision Host Network details for host B, use host B's netowkr port and cable info recorded earlier.
  • Run sre.dns.netbox to push updated dns for hosts A and B.
  • Run sre.network.configure-switch-interfaces for host A and host B

Per host setup checklist

dns7001
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • NIC doesn't detect, remote hands task updated with info, don't check this line off until NIC is reseated and detects - remote hands reseated nic and now it works and shows link to port 8 via idrac interface
  • Run the sre.hosts.reimage cookbook
dns7002
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
ganeti7001
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
  • create and link in service implementation task for Mortiz for ganeti700[1-4]
ganeti7002
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
  • create and link in service implementation task for Mortiz for ganeti700[1-4]
ganeti7003
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
ganeti7004
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
lvs7001
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook

lvs7001 network connections
nic port 1: 70101 asw1-b3-magru (WMF12033) et-0/0/11 (black DAC)
nic port 2: 70114 asw1-b4-magru (WMF12034) et-0/0/12 (yellow singlemode)

lvs7003
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook

lvs7003 network info:
nic port 0: cable 70100 : asw1-b3-magru:et-0/0/8
nic port 1: cable 70113 : asw2-b4-magru:et-0/0/13

cp7001
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
cp7002
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
cp7003
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
cp7004
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
cp7006
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
cp7008
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
cp7010
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook
cp7015
  • Ensure host is fully migrated per T376737
  • Update netbox with host's new location and service tag, copy down old hostname cable IDs, delete off old hostname, and install via script onto this host.
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hosts.reimage cookbook

Related Objects

StatusSubtypeAssignedTask
Resolvedssingh

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH added a parent task: Restricted Task.Nov 19 2024, 6:05 PM
RobH moved this task from Backlog to Racking Tasks on the ops-magru board.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7003.magru.wmnet

  • ganeti7003.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7006.magru.wmnet

  • cp7006.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7004.magru.wmnet

  • ganeti7004.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7008.magru.wmnet

  • cp7008.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: lvs7003.magru.wmnet

  • lvs7003.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Failed to wipe swraid, partition-table and filesystem signatures, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7015.magru.wmnet

  • cp7015.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm completed:

  • ganeti7003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251935_robh_3696544_ganeti7003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye executed with errors:

  • cp7015 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp7015.magru.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7015.magru.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7015.magru.wmnet with OS bullseye executed with errors:

  • cp7015 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp7015.magru.wmnet" to get a root shell, but depending on the failure this may not work.
RobH updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm executed with errors:

  • dns7001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dns7001.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS bullseye completed:

  • lvs7003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261320_fabfur_575669_lvs7003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS bullseye completed:

  • cp7015 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261335_fabfur_580242_cp7015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bullseye completed:

  • dns7001 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261409_fabfur_587599_dns7001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bookworm completed:

  • dns7001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261511_fabfur_601294_dns7001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7001.magru.wmnet

  • ganeti7001.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7003.magru.wmnet

  • cp7003.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

RobH updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7002.magru.wmnet

  • ganeti7002.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: No non-mgmt connected interfaces found for ganeti7002. Please check Netbox.

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7004.magru.wmnet

  • cp7004.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: No non-mgmt connected interfaces found for cp7004. Please check Netbox.

ERROR: some step on some host failed, check the bolded items above

RobH updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: dns7002.wikimedia.org

  • dns7002.wikimedia.org (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7002.magru.wmnet

  • cp7002.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7010.magru.wmnet

  • cp7010.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: lvs7001.magru.wmnet

  • lvs7001.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

@MoritzMuehlenhoff : ganeti700[12] are ready for reimage but I've just run out of steam for today. If you don't get to their reimage on Wednesday I'll do so on my Wednesday AM.

Change #1098554 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: fix magru ip addresses during migration

https://gerrit.wikimedia.org/r/1098554

Change #1098554 merged by Fabfur:

[operations/puppet@production] hiera: fix magru dns7001 ip address during migration

https://gerrit.wikimedia.org/r/1098554

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7010.magru.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7010.magru.wmnet with OS bullseye completed:

  • cp7010 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411271651_fabfur_818716_cp7010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

lvs7003 has been restarted after cable swap, all fine

BGP flag enabled on NetBox for lvs700[1-3] and dns700[12] and BGP enabled on TOR switches for those hosts

Mentioned in SAL (#wikimedia-operations) [2024-11-27T18:37:20Z] <fabfur@cumin1002> START - Cookbook sre.hosts.downtime for 4:00:00 on dns7001.wikimedia.org with reason: T380307

Mentioned in SAL (#wikimedia-operations) [2024-11-27T18:37:24Z] <fabfur@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns7001.wikimedia.org with reason: T380307

Removed downtime from all lvs, dns and cp hosts in magru

Repooled dnsbox cluster and run authdns-update

Repooled all depooled cp hosts before repooling whole DC

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm completed:

  • ganeti7004 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411252034_robh_3708720_ganeti7004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

FYI I have aborted the last reimage execution that was at the last step waiting for use input for the netbox-hiera integration sync. Those changes have showed up to others and got already merged anyway. In general please don't leave cookbooks hanging for user input for days.

FYI I have aborted the last reimage execution that was at the last step waiting for use input for the netbox-hiera integration sync. Those changes have showed up to others and got already merged anyway. In general please don't leave cookbooks hanging for user input for days.

Accidental, sorry about that! I had too many screen sessions and seemed to have left one abandoned!

ssingh claimed this task.