Page MenuHomePhabricator

Relocate lvs1013-lvs1016 to rows E & F
Closed, ResolvedPublic

Description

After some conversations with @ayounsi and @cmooney we need to relocate lvs1013-lvs1016 to rows E & F to be able to load test Katran without impacting the core routers in eqiad.

Those hosts are currently idling and using the puppet role insetup_noferm so no depooling of any sort is required. Additionally and given that Katran just uses a single NIC, only the main 10G NIC of each server needs to be connected to their respective switch.

lvs1013 relocation checklist:

  • - decom cookbook run for host
  • - move server into new rack, updater netbox
  • - relocate server script run in netbox for host (by dc ops)
  • - server confirmed online and remotely accessible
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

lvs1014 relocation checklist:

  • - decom cookbook run for host
  • - move server into new rack, updater netbox
  • - relocate server script run in netbox for host (by dc ops)
  • - server confirmed online and remotely accessible
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

lvs1015 relocation checklist:

  • - decom cookbook run for host
  • - move server into new rack, updater netbox
  • - relocate server script run in netbox for host (by dc ops)
  • - server confirmed online and remotely accessible
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

lvs1016 relocation checklist:

  • - decom cookbook run for host
  • - move server into new rack, updater netbox
  • - relocate server script run in netbox for host (by dc ops)
  • - server confirmed online and remotely accessible
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

Vgutierrez moved this task from Backlog to Radar/Not for Service on the Traffic board.

@ayounsi @cmooney could you let DCops know which racks would be better for these boxes? Thanks!

@ayounsi @cmooney could you let DCops know which racks would be better for these boxes? Thanks!

I am on-site this week in eqiad. Can I get some feedback on the preferred racks/switches for these hosts to relocate into and I can move them?

@RobH There is no real preference on my side.

I would say pick one rack from E1/E2/E3/F1/F2/F3 and put the first 3 of them in that one, then place lvs1016 in a different rack from that same list.

We need to follow these instructions to reimage / change IP for the new racks. The important thing to note is the decom cookbook should be run before the move.

Thanks @cmooney, @Fabfur will take care of running the decom cookbook (thanks!)

RobH triaged this task as Medium priority.Jul 18 2023, 2:09 PM
RobH updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: lvs1013.eqiad.wmnet

  • lvs1013.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

lvs1013.eqiad.wmnet has been decommissioned via cookbook @Tue 18 Jul 2023 02:24:10 PM UTC

I will not log next host decommissioning for redundancy with the decommission cookbook comment.

cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: lvs1014.eqiad.wmnet

  • lvs1014.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: lvs1015.eqiad.wmnet

  • lvs1015.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye

cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: lvs1016.eqiad.wmnet

  • lvs1016.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
RobH updated the task description. (Show Details)

@RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C:

2023-07-18 16:01:28,549 robh 2034852 [INFO] Completed command '/usr/local/sbin/dhcpincludes -r commit'                                                       
2023-07-18 16:01:28,550 robh 2034852 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'.                                  
2023-07-18 16:01:28,551 robh 2034852 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                         
2023-07-18 16:01:28,551 robh 2034852 [ERROR] Ctrl+c pressed

any issues with the box? should we take the task from here and reimage? thanks!

Oh, take into account that these boxes have several PCI-E NICs that are preferred over the onboard NICs, so that could trigger some issues with PXE and DHCP

FYI this Netbox report is alerting:
https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency

xe-0/0/41 [eqiad] Interface type '10gbase-x-sfpp' does not match '1000base-t' set on other(s) in same block on lsw1-e1-eqiad. Ports and 44 need to be same type.
xe-0/0/43 [eqiad] Interface type '10gbase-x-sfpp' does not match '1000base-t' set on other(s) in same block on lsw1-e1-eqiad. Ports and 44 need to be same type.

Because ge-0/0/40 and ge-0/0/42 are 1G

Thanks @ayounsi

@RobH you can probably connect them to 44 and 45 instead.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye

@RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C:

2023-07-18 16:01:28,549 robh 2034852 [INFO] Completed command '/usr/local/sbin/dhcpincludes -r commit'                                                       
2023-07-18 16:01:28,550 robh 2034852 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'.                                  
2023-07-18 16:01:28,551 robh 2034852 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                         
2023-07-18 16:01:28,551 robh 2034852 [ERROR] Ctrl+c pressed

any issues with the box? should we take the task from here and reimage? thanks!

Oh, take into account that these boxes have several PCI-E NICs that are preferred over the onboard NICs, so that could trigger some issues with PXE and DHCP

The installer failed and I didn't have time yesterday to check into it, so relaunching it today to see whats up. This is no longer on-site blocked though so if you want to take over remotely on them that is also fine! I'm curious to see what the error was yesterday so I'll just let the reimage script run.

FYI this Netbox report is alerting:
https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency

xe-0/0/41 [eqiad] Interface type '10gbase-x-sfpp' does not match '1000base-t' set on other(s) in same block on lsw1-e1-eqiad. Ports and 44 need to be same type.
xe-0/0/43 [eqiad] Interface type '10gbase-x-sfpp' does not match '1000base-t' set on other(s) in same block on lsw1-e1-eqiad. Ports and 44 need to be same type.

Because ge-0/0/40 and ge-0/0/42 are 1G

Wait, these are all 10G nics, so the ports on the switch are setup as 1G and I need to move to non 1G setup ports, is that correct? Just checking cuz I don't really understand why I'm getting 1G setup errors in 10G ports.

I didn't notice this until I left the datacenter today, so I'll move and update tomorrow.

Ok, the Bullseye OS has issues with the drivers for some of the hardware...

Considering these are R430s, I don't think it is worth putting in time to install support for them in Bullseye, is there a different OS that would work for these tests on these hosts?

Screen Shot 2023-07-19 at 1.45.36 PM.png (670×970 px, 129 KB)

@RobH they will need to have their switch port moved.

On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be 1G.

Here port 40 and port 42 are configured at 1G (for existing servers), so ports 41 and 43 (the 2 LVS) can't be at 10G. They unfortunately need to move further away from 40 and 42, like 44, 45 like suggested by @cmooney

@RobH they will need to have their switch port moved.

On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be 1G.

Here port 40 and port 42 are configured at 1G (for existing servers), so ports 41 and 43 (the 2 LVS) can't be at 10G. They unfortunately need to move further away from 40 and 42, like 44, 45 like suggested by @cmooney

Cool, I understand now. I'll move and update netbox/homer for these two hosts tomorrow to move them to 10G configured ports 44/45

Cool, I understand now. I'll move and update netbox/homer for these two hosts tomorrow to move them to 10G configured ports 44/45

I renumbered them to 44/45 in Netbox now and ran Homer for you. So should be good to go once you re-cable.

links moved, servers online for remote os installation.

Ready for installation!

lvs1013-lvs1015 have been reimaged as expected, we've been unable to reimage lvs1016.
cookbook log shows an error when attempting to fetch the main NIC MAC address:

2023-07-21 14:58:20,092 sukhe 2648934 [INFO clustershell.py:78 in execute] Executing commands [cumin.transports.Command('/usr/bin/facter -p networking.mac')] on '1' hosts: lvs1016.eqiad.wmnet
2023-07-21 14:58:20,096 sukhe 2648934 [DEBUG clustershell.py:590 in ev_pickup] node=lvs1016.eqiad.wmnet, command='/usr/bin/facter -p networking.mac'
2023-07-21 14:58:20,288 sukhe 2648934 [DEBUG clustershell.py:783 in ev_hup] node=lvs1016.eqiad.wmnet, rc=255, command='/usr/bin/facter -p networking.mac'
This comment was removed by Fabfur.

We finally managed to reinstall lvs1016, thanks for all the support!