Page MenuHomePhabricator

Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename
Closed, ResolvedPublic

Description

Currently:

cloudnet1007-dev: E4
cloudnet1008-dev: F4

Desired:

cloudgw1003.eqiad.wmnet: C8
cloudgw1004.eqiad.wmnet: D5

We will also be renaming these hosts after the move. Do you currently tag hosts with a generic ID or should they be relabeled at the same time?

Event Timeline

Andrew renamed this task from Relocate cloudnet1007-dev and cloudnet1008-dev to new racks to Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename.Dec 18 2024, 2:45 PM
Andrew updated the task description. (Show Details)
fnegri moved this task from Inbox to Hardware on the cloud-services-team board.

Note that these servers are not currently in service, so this move can happen anytime w/out WMCS coordination.

@RobH do you think that this can be done in the next one/two weeks? We need these servers to replace the current pair of cloudgw servers, one of which is having hardware issues (T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev).

If relocating these servers is gonna take longer, we should debug the hardware issues on the current one (cloudgw1002).

@RobH do you think that this can be done in the next one/two weeks? We need these servers to replace the current pair of cloudgw servers, one of which is having hardware issues (T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev).

If relocating these servers is gonna take longer, we should debug the hardware issues on the current one (cloudgw1002).

Items being renamed and refhuffled within wmcs dedicated hardware can be coordinated directly with @Jclark-ctr and @VRiley-WMF. No need to ping me when it is just renaming hardware within the WMCS stack.

Change #1113520 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] remove refs to renamed cloudnet100[78]-dev

https://gerrit.wikimedia.org/r/1113520

Change #1113521 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Initial setup for cloudnet100[34]

https://gerrit.wikimedia.org/r/1113521

Change #1113520 merged by Andrew Bogott:

[operations/puppet@production] remove refs to renamed cloudnet100[78]-dev

https://gerrit.wikimedia.org/r/1113520

cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: cloudnet1007-dev.eqiad.wmnet

  • cloudnet1007-dev.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change #1113521 merged by Andrew Bogott:

[operations/puppet@production] Initial setup for cloudnet100[34]

https://gerrit.wikimedia.org/r/1113521

cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: cloudnet1008-dev.eqiad.wmnet

  • cloudnet1008-dev.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: cloudnet1007-dev.eqiad.wmnet

  • cloudnet1007-dev.eqiad.wmnet (FAIL)
    • Missing DNSName in Nebox for cloudnet1007-dev, unable to verify it.
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.3.228
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: cloudnet1008-dev.eqiad.wmnet

  • cloudnet1008-dev.eqiad.wmnet (FAIL)
    • Missing DNSName in Nebox for cloudnet1008-dev, unable to verify it.
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.3.229
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Ran through decomission on both servers and moved them to the corrosponding locations

cloudgw1003.eqiad.wmnet: C8 U17 (CableID: 5201 Port: 32)
cloudgw1004.eqiad.wmnet: D5 U17 (CableID: 5350 Port: 17)

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bookworm executed with errors:

  • cloudgw1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudgw1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1004.eqiad.wmnet with OS bookworm

pxe booting is failing on both of these servers:

PXE-E51: No DHCP or proxyDHCP offers were received.

More netbox work is needed, I assume

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1004.eqiad.wmnet with OS bookworm executed with errors:

  • cloudgw1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudgw1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

The servers were getting the IP address from private 1-C and private 1-D, and not from the eqiad cloudhost VLAN. This has been corrected.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudgw1003.eqiad.wmnet with OS bookworm completed:

  • cloudgw1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501281212_andrew_2823573_cloudgw1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

The servers were getting the IP address from private 1-C and private 1-D, and not from the eqiad cloudhost VLAN. This has been corrected.

It's better cloud hosts go on the per-rack cloud-hosts vlans when running the Netbox provision script. So for instance cloud-hosts1-c8-eqiad and cloud-hosts1-d5-eqiad here instead of the legacy stretched cloud-hosts1-eqiad. But no big deal everything will work ok as they are.

@Papaul the additional trunked vlans on the switch perfectly match the Wikitech guidelines for cloudgw. Thanks for that!

@Andrew @aborrero remember we have these static routes on the cloudsw pointing to the cloudgw VRRP VIP on the cloud-instance-transport1-b-eqiad vlan (1120:

set routing-instances cloud routing-options static route 172.16.0.0/21 next-hop 185.15.56.244
set routing-instances cloud routing-options static route 185.15.56.0/25 next-hop 185.15.56.244
set routing-instances cloud routing-options static route 185.15.56.236/30 next-hop 185.15.56.244

I'm guessing you're gonna migrate by removing one old cloudgw from the VRRP group, adding one of the new ones in it's place, flipping to the new one and repeating? In other words this VIP won't change? If that's the case all good, but just mentioning in case there are plans to change IP, in which case we'll need to co-ordinate and update the switches too.

I'm guessing you're gonna migrate by removing one old cloudgw from the VRRP group, adding one of the new ones in it's place, flipping to the new one and repeating? In other words this VIP won't change? If that's the case all good, but just mentioning in case there are plans to change IP, in which case we'll need to co-ordinate and update the switches too.

Yes, I confirm our plan is to replace in-place:

  • first replace cloudg1002 with cloudgw1003
  • a few days later, replace cloudgw1001 with cloudgw1004

In both cases, they take over the corresponding IPs on the cloud edge networks, and have new IP on the host management vlan.

The replacement of cloudgw1002 with cloudgw1003 has been scheduled for tomorrow. I'll send you a calendar invite @cmooney

These moves are done.