Page MenuHomePhabricator

rename cloudswift1001 as cloudlb1001
Closed, ResolvedPublic

Description

Netbox device: https://netbox.wikimedia.org/dcim/devices/3524/

Procedure: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

  • decomission
  • netbox: edit the device name, and set its status from DECOMMISSIONING to PLANNED.
  • readd the DNS Name field for the management interface
  • run sre.dns.netbox cookbook
  • run sre.network.configure-switch-interfaces cookbook
  • reimage server with new name

Event Timeline

aborrero changed the task status from Open to In Progress.
aborrero triaged this task as Medium priority.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudswift1001.eqiad.wmnet

  • cloudswift1001.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

Change 936019 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb1001/1002: add role

https://gerrit.wikimedia.org/r/936019

Change 936019 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb1001/1002: add role

https://gerrit.wikimedia.org/r/936019

Change 936022 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: eqiad: bootstrap hiera data

https://gerrit.wikimedia.org/r/936022

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudlb1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

T341223: Configure eqiad cloudsw devices to support cloud-private is no longer a blocker.

Now blocked by some kind of DHCP error preventing the boot into the installer, apparently. I need to research more.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

aborrero changed the task status from Stalled to In Progress.Jul 7 2023, 3:03 PM

Now blocked by some kind of DHCP error preventing the boot into the installer, apparently. I need to research more.

The host can only do DHCP boot while the reimage cookbook is running. This wasn't really any issue.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudlb1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
aborrero added a project: ops-eqiad.
aborrero added subscribers: Jclark-ctr, Papaul.

hey @Papaul or @Jclark-ctr I'm requesting help with this host.

We are trying to reimage after renaming it from cloudswift1001 to cloudlb1001.
The debian installer shows up, but then it complained about a disks and root partition.

I checked on the iDrac GUI and found this warning:

image.png (509×1 px, 92 KB)

I run the two commands in there, but now the reimage cookbook fails with:

aborrero@cumin1001:~ $ sudo cookbook sre.hosts.reimage --new --os bullseye --task-id T341200 cloudlb1001
==> ATTENTION: destructive action for host: cloudlb1001
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Management Password: 
Running IPMI command: ipmitool -I lanplus -H cloudlb1001.mgmt.eqiad.wmnet -U root -E chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/ipmi.py", line 85, in command
    output = run(command + command_parts, env=self.env.copy(), stdout=PIPE, check=True).stdout.decode()
  File "/usr/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ipmitool', '-I', 'lanplus', '-H', 'cloudlb1001.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1.

Trying by hand:

aborrero@cumin1001:~$ ipmitool -I lanplus -H cloudlb1001.mgmt.eqiad.wmnet -U root -E chassis power status
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session

Maybe this is something you know how to fix rather quick instead of me messing around. Help is appreciated.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudlb1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

@aborrero the issue was that IPMI was disable on the node I enable it and try to install the OS, the installation completed but failed for some other issue that i didn't take time to look into. So you can try to reimage again.

Change 936022 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: eqiad: bootstrap hiera data

https://gerrit.wikimedia.org/r/936022

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye completed:

  • cloudlb1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202307100938_aborrero_4003617_cloudlb1001.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307100944_aborrero_4003617_cloudlb1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
aborrero updated the task description. (Show Details)