Page MenuHomePhabricator

Q3:test NIC for lvs1017
Closed, ResolvedPublic

Description

We want to benchmark a Mellonox NIC to see if we can get more throughput than our current Broadcom chips: How does the combination of a new piece of hardware and its underlying drivers affect our workload?

lvs1017 will have a 4 port nic ordered via T381118. This task will track the installation.

lvs1016 will handle responsibilities while lvs1017 is being tested.

Racking and Testing High-level overview

  • Run decom cookbook for lvs1016
  • Physically move lvs1016 to rack A7
  • Connect lvs1016 primary 10G port (enp4s0f0) to a free 10G port on asw2-a7-eqiad
  • Run the Netbox provision script for lvs1016 to add this primary link in Netbox, assign it IPs etc.
  • Add lvs1016
    • Submit CRs
      • Update modules/profile/manifests/lvs/configuration.pp
        • Add lvs1016 to the end of the list for high-traffic1
        • Add lvs1016 to $lvs_classes, setting it to high-traffic.
      • Add a hieradata override for lvs1016 (hieradata/hosts/lvs1016.yaml) and set profile::pybal::override_bgp_med: 200
      • Add lvs1016 IPs to haproxy_allowed_healthcheck_sources in hieradata/common.yaml
    • Reimage lvs1016
    • Create lvs1016 hieradata override for profile::lvs::interface_tweaks
    • Set BGP to true in lvs1016's netbox page
    • Run sudo homer "cr*-eqiad*" commit "enable BGP on lvs1016" on cumin
  • Remove lvs1017
    • Downtime lvs1017
    • Stop Puppet on lvs1017
    • Stop PyBal on lvs1017
    • Verify lvs1020 has taken over traffic via Grafana
    • Run the decommission cookbook for lvs1017
  • Promote lvs1016
    • Verify Icinga alerts and connectivity for lvs1016
    • Submit CRs
      • Promote lvs1016 in modules/profile/manifests/lvs/configuration.pp's high-traffic1 (Final list being lvs1016, lvs1020)
      • Remove lvs1017 from hieradata/common.yaml, modules/profile/manifests/lvs/configuration.pp, and hieradata/common/lvs/interfaces.yaml
      • Remove MED override for lvs1016 (hieradata/hosts/lvs1016.yaml)
    • Run run-puppet-agent on lvs1016
    • Restart pybal on lvs1016, setting it to primary
    • Restart pybal on lvs1020 to sync changes.
  • Set up lvs1017 with new NIC
    • dcops Remove lvs1017 from the rack, install the Mellanox NIC in it in the primary PCIe slot
    • dcops Move lvs1017 to rack E2, connect its primary uplink to any spare port
    • Run the Netbox provision script for lvs1017 to document this link and assign the server appropriate IPs on private1-e2-eqiad vlan
    • Reimage lvs1017 to whatever role is needed for the testing

In case of unexpected emergency, depool eqiad with sudo cookbook sre.dns.admin depool eqiad on cumin.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
VRiley-WMF changed the task status from Open to In Progress.May 28 2025, 7:41 PM

Running the decom cookbook on lvs1016 soon

lvs1016 new location info

A7
U27
CableID: 5081
port 27

As per instructions, the following has been completed.

  1. Run decom cookbook for lvs1016
  2. Physically move lvs1016 to rack A7
  3. Connect lvs1016 primary 10G port (enp4s0f0) to a free 10G port on asw2-a7-eqiad
  4. Run the Netbox provision script for lvs1016 to add this primary link in Netbox, assign it IPs etc.

@cmooney would you be able to complete

  1. Manually update the trunked vlans for the new port on asw2-a-eqiad to add the other row A vlans
  2. Fail over lvs1017 to lvs1020

and we can pick it up from that point?

@VRiley-WMF no problem thanks!

I'm actually not sure if step 5 is needed (if it is we will also need the additional ports on lvs1016 connected as per the table in T387145#10720903), it might not be required. When I know for sure I'll add whatever is needed in Netbox/switches.

Step 6 the traffic team should be able to take care of when they are ready.

Ok step 5 is not needed, lvs1016 will only require it's primary interface connected. I've configured the switch port it's connected on so it should be good to go.

Brett / Valentin I will leave you guys to reimage it and do the next steps. Ping me if any questions cheers.

BCornwall renamed this task from Q3:test NIC for lvs1017 or lvs1018 to Q3:test NIC for lvs1017.May 30 2025, 3:51 PM
BCornwall updated the task description. (Show Details)

Change #1152817 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] lvs: Switch lvs1017/lvs1020 primary

https://gerrit.wikimedia.org/r/1152817

Change #1152817 abandoned by BCornwall:

[operations/puppet@production] lvs: Switch lvs1017/lvs1020 primary

Reason:

It seems that people seem to agree on not merging 1152817 as it messes with secondary availability; When time is scheduled with dcops we should just stop pybal/disable puppet when decommissioning lvs1017 and just let it fail over to lvs1020

https://gerrit.wikimedia.org/r/1152817

BCornwall updated the task description. (Show Details)
BCornwall updated the task description. (Show Details)
BCornwall updated the task description. (Show Details)
BCornwall updated the task description. (Show Details)

Change #1153418 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] hiera: Replace lvs1017 with lvs1016

https://gerrit.wikimedia.org/r/1153418

Commenting on this with my own understanding and for review of others. After that, letting @BCornwall handle updating the task description.

IMO the way we have done this in the past was like this (I can explain further on why I think we should do it like this):

Adding lvs1016

  • add lvs1016 to the end of the list for high-traffic1. also add it to $lvs_classes below in the same file (modules/profile/manifests/lvs/configuration.pp), setting it to high-traffic1. So you will have three items: lvs1017, lvs1020, and lvs1016.
    • Also add a hieradata override for it (hieradata/lvs1016.yaml) so that when it comes up, it doesn't start serving traffic until we have checked it. profile::pybal::override_bgp_med: 200 ensure this.
    • We also need entries in hieradata/common.yaml for lvs1016.

Removing lvs1017

  • stop Pybal on lvs1017, making double-sure Puppet is disabled otherwise Puppet will start Pybal again.
  • once you ensure lvs1020 is serving traffic as expected (Grafana dashboard for lvs connections), decom lvs1017. Run the decommissioning cookbook.

Switching lvs1016 to high-traffic1

  • ensure all is OK on lvs1016, check Icinga, check connectivity.
  • Make a commit to do this (one commit with all steps, perhaps you can stack it on the commit above):
    • put lvs1016 as the first item in the list of high-traffic1. Final list: lvs1016, lvs1020
    • Remove all traces of lvs1017, including from modules/profile/manifests/lvs/configuration.pp, hieradata/common/lvs/interfaces.yaml, hieradata/common.yaml and anywhere else you think is fit (site.pp?).
    • Remove MED override for lvs1016 so that when you restart Pybal, it becomes primary high-traffic1 from lvs1020.

^ With the above commit when you merge and restart Pybal, lvs1016 should be primary again.

Check with @cmooney for changes required to hieradata/common/lvs/interfaces.yaml (to add lvs1016 there) and also to hieradata/role/eqiad/lvs/balancer.yaml in case profile::lvs::tagged_subnets needs to be updated (don't think so but check).

To my knowledge this only now applies to low-traffic services in eqiad/codfw (behind K8s and some search ones). I believe everything else is using IPIP so L2 adjacency / extra interfaces is not required on the LVS nodes serving those.

Check with @cmooney for changes required to hieradata/common/lvs/interfaces.yaml (to add lvs1016 there) and also to hieradata/role/eqiad/lvs/balancer.yaml in case profile::lvs::tagged_subnets needs to be updated (don't think so but check).

To my knowledge this only now applies to low-traffic services in eqiad/codfw (behind K8s and some search ones). I believe everything else is using IPIP so L2 adjacency / extra interfaces is not required on the LVS nodes serving those.

Yes, that was my understanding as well but I wanted to be absolutely sure. Thanks for confirming! (Brett: Updating the comment above to remove these bits since we are only doing high-traffic1 with no requirement for L2 adjacency and hence the hieras are not required.)

Change #1154905 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Promote lvs1016 over lvs1017

https://gerrit.wikimedia.org/r/1154905

Change #1153418 merged by BCornwall:

[operations/puppet@production] hiera: Add lvs1016 to high-traffic1

https://gerrit.wikimedia.org/r/1153418

Okay, so we're ready to reimage lvs1016 but it appears that the mgmt interface isn't reachable. Could dcops look into this, please?

@BCornwall Hey there, thanks for letting us know. I did replace the cable and it seems to respond to ping. Would you be able to check again? It seems to be active on the server.

@VRiley-WMF Thanks for the quick response! I've not been able to ping the mgmt interface (10.65.0.75) from lvs1017, cumin1002, and cumin2002. It's timing out.

Okay, I found the problem (I pinged the incorrect IP) I set the IP address on the iDRAC to the one listed in netbox. I just tested out the ping and it seems to respond. It's all yours!

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye

Change #1160202 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] hiera: override interface names for lvs1016

https://gerrit.wikimedia.org/r/1160202

Change #1160202 merged by BCornwall:

[operations/puppet@production] hiera: override interface names for lvs1016

https://gerrit.wikimedia.org/r/1160202

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye completed:

  • lvs1016 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506171625_brett_1484312_lvs1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Mentioned in SAL (#wikimedia-operations) [2025-06-17T17:25:51Z] <brett> homer "cr*-eqiad*" commit "enable BGP on lvs1016" - T387145

Mentioned in SAL (#wikimedia-operations) [2025-06-17T17:38:31Z] <brett> stopping pybal on lvs1017 to move traffic over to lvs1020 - T387145

Mentioned in SAL (#wikimedia-operations) [2025-06-17T17:43:00Z] <brett@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on lvs1017.eqiad.wmnet with reason: T387145

cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: lvs1017.eqiad.wmnet

  • lvs1017.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

Change #1154905 merged by BCornwall:

[operations/puppet@production] Promote lvs1016 over lvs1017

https://gerrit.wikimedia.org/r/1154905

Mentioned in SAL (#wikimedia-operations) [2025-06-17T18:59:17Z] <brett> Restarting pybal on lvs1016, setting it to primary - T387145

@VRiley-WMF Okay! We've reimaged lvs1016 as the new primary and have lvs1020 as secondary. lvs1017 has been decommissioned and is ready to be removed/serviced.

Thank you!

Unracked lvs1017 and installing the card now

Inserted new NIC. Moved the server to the new location (E2, U39, Port 39), ran the netbox script, and everything went through smoothly. @BCornwall it should be ready for the reimage! Please let us know if you need anything else.

@BCornwall Hey, I just wanted to check in with this to see if anything else is needed with this at the moment? If so, are we able to close this, or would you like to continue to keep it open for the moment?

Sorry for the delay; I had to take a few unexpected days off but will get back to this shortly!

Change #1167675 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] site: Set lvs1017 to insetup_noferm

https://gerrit.wikimedia.org/r/1167675

Change #1167675 merged by BCornwall:

[operations/puppet@production] site: Set lvs1017 to insetup_noferm

https://gerrit.wikimedia.org/r/1167675

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye

I've updated lvs1017's BIOS and Mellanox firmware to the latest versions (2.23.0 and 16.35.30.06) prior to reimaging

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1017 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console lvs1017.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Hi, @VRiley-WMF, I'm unable to get lvs1017 to PXE boot - I'm getting media test failure errors that advise checking the cables. I'm able to ping the connected switch (lsw1-e2, 10.65.1.229) from lvs1017's idrac but I'm denied access to log in to the switch itself to investigate any further. Is this possibly a connectivity issue?

I will look into this. I believe it may be due to lvs1017's nic being misconfigured. I will update it and test it out

Hey @BCornwall I have swapped the cables, would you be able to test this again? (I was going to try to reimage it, but didn't know what version of bullseye to put on it, I was going to assume 7, but rather be safe with it)

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye

Looks like that worked, it's booting PXE now. Thanks!

Awesome! is this okay to close out?

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1017 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console lvs1017.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm completed:

  • lvs1017 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507142028_brett_3269879_lvs1017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
BCornwall updated the task description. (Show Details)

We're all set. Thank you for all your help, @VRiley-WMF!

Ah, @VRiley-WMF, it seems that connectivity is no longer through the Mellanox card:

[    9.128067] mlx5_core 0000:3b:00.0: Port module event: module 0, Cable unplugged
[   10.183804] mlx5_core 0000:3b:00.1: Port module event: module 1, Cable unplugged

We're going to need this connectivity through that card and not any others... would it be possible to get that set up?

Orginally put the cable into the onboard port. Once it was able to reimage, that's when I just moved it over. It should be all set now.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm executed with errors:

  • lvs1017 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console lvs1017.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm completed:

  • lvs1017 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507151755_brett_3906134_lvs1017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The link was re-connected to the Mellanox card; We then reconfigured the interface with:

$ sudo -i cookbook sre.dns.netbox -t T387145 'update lvs1017'
$ sudo -i cookbook sre.network.configure-switch-interfaces lvs1017
$ sudo -i cookbook sre.hosts.provision lvs1017.eqiad.wmnet --no-user --no-dhcp

And a reimage worked!

[   19.986056] mlx5_core 0000:3b:00.0 ens1f0np0: Link up

We're good now. Thanks for the help!

cmooney mentioned this in Unknown Object (Task).Aug 20 2025, 7:13 PM