rename cloudswift1001 as cloudlb1001
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jul 6 2023, 11:02 AM

Description

Netbox device: https://netbox.wikimedia.org/dcim/devices/3524/

Procedure: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

decomission
netbox: edit the device name, and set its status from DECOMMISSIONING to PLANNED.
readd the DNS Name field for the management interface
run sre.dns.netbox cookbook
run sre.network.configure-switch-interfaces cookbook
reimage server with new name

Details

	Subject	Repo	Branch	Lines +/-
	cloudlb: eqiad: bootstrap hiera data	operations/puppet	production	+18 -0
	cloudlb1001/1002: add role	operations/puppet	production	+3 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	• taavi	T341060 openstack eqiad1: introduce cloud-private and cloudlb
Resolved	aborrero	T341061 eqiad1: repurpose 2 cloudswift servers as cloudlb
Resolved	Papaul	T341200 rename cloudswift1001 as cloudlb1001

Event Timeline

aborrero changed the task status from Open to In Progress.Jul 6 2023, 11:02 AM

aborrero triaged this task as Medium priority.

aborrero created this task.

aborrero moved this task from Backlog to Doing on the User-aborrero board.

aborrero updated the task description. (Show Details)Jul 6 2023, 11:07 AM

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudswift1001.eqiad.wmnet

cloudswift1001.eqiad.wmnet (WARN)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

COMMON_STEPS (FAIL)
- Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

aborrero updated the task description. (Show Details)Jul 6 2023, 11:27 AM

aborrero updated the task description. (Show Details)Jul 6 2023, 11:40 AM

saving this here for later

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

Change 936019 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb1001/1002: add role

https://gerrit.wikimedia.org/r/936019

gerritbot added a project: Patch-For-Review.Jul 6 2023, 11:58 AM

Change 936019 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb1001/1002: add role

https://gerrit.wikimedia.org/r/936019

Change 936022 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: eqiad: bootstrap hiera data

https://gerrit.wikimedia.org/r/936022

aborrero updated the task description. (Show Details)Jul 6 2023, 12:17 PM

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye executed with errors:

cloudlb1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

blocked on T341223: Configure eqiad cloudsw devices to support cloud-private

aborrero moved this task from Blocked to Doing on the User-aborrero board.Jul 7 2023, 9:45 AM

T341223: Configure eqiad cloudsw devices to support cloud-private is no longer a blocker.

Now blocked by some kind of DHCP error preventing the boot into the installer, apparently. I need to research more.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

In T341200#8997186, @aborrero wrote:

Now blocked by some kind of DHCP error preventing the boot into the installer, apparently. I need to research more.

The host can only do DHCP boot while the reimage cookbook is running. This wasn't really any issue.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye executed with errors:

cloudlb1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

hey @Papaul or @Jclark-ctr I'm requesting help with this host.

We are trying to reimage after renaming it from cloudswift1001 to cloudlb1001.
The debian installer shows up, but then it complained about a disks and root partition.

I checked on the iDrac GUI and found this warning:

I run the two commands in there, but now the reimage cookbook fails with:

aborrero@cumin1001:~ $ sudo cookbook sre.hosts.reimage --new --os bullseye --task-id T341200 cloudlb1001
==> ATTENTION: destructive action for host: cloudlb1001
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Management Password: 
Running IPMI command: ipmitool -I lanplus -H cloudlb1001.mgmt.eqiad.wmnet -U root -E chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/ipmi.py", line 85, in command
    output = run(command + command_parts, env=self.env.copy(), stdout=PIPE, check=True).stdout.decode()
  File "/usr/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ipmitool', '-I', 'lanplus', '-H', 'cloudlb1001.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1.

Trying by hand:

aborrero@cumin1001:~$ ipmitool -I lanplus -H cloudlb1001.mgmt.eqiad.wmnet -U root -E chassis power status
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session

Maybe this is something you know how to fix rather quick instead of me messing around. Help is appreciated.

Maintenance_bot added a project: SRE.Jul 7 2023, 4:29 PM

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye executed with errors:

cloudlb1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- The reimage failed, see the cookbook logs for the details

@aborrero the issue was that IPMI was disable on the node I enable it and try to install the OS, the installation completed but failed for some other issue that i didn't take time to look into. So you can try to reimage again.

Change 936022 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: eqiad: bootstrap hiera data

https://gerrit.wikimedia.org/r/936022

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Jul 10 2023, 9:10 AM

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye completed:

cloudlb1001 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202307100938_aborrero_4003617_cloudlb1001.out, asking the operator what to do
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307100944_aborrero_4003617_cloudlb1001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

aborrero closed this task as Resolved.Jul 10 2023, 3:41 PM

aborrero updated the task description. (Show Details)

fnegri moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q4) board.Jul 27 2023, 3:14 PM

	F37132300: image.png
	Jul 7 2023, 3:58 PM

	F37130483: image.png
	Jul 6 2023, 11:44 AM

rename cloudswift1001 as cloudlb1001Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

rename cloudswift1001 as cloudlb1001
Closed, ResolvedPublic
Actions

Related Objects
Search...