Page MenuHomePhabricator

Take advantage of 10Gb NICs in the new network stack
Open, LowPublic

Description

As a follow up of a chat occurred during the SRE Summit, here's a tentative list of hosts with 10G NICs that are currently not using it as their primary interface and are connected via 1Gb.

I extracted the list from:

sudo cumin 'A:all and not A:vms' 'facter -p "net_driver.$(facter -p interface_primary).speed" | grep -v "^10000\$" && lspci | grep -i ethernet'

I didn't want to rely on a grep for 10G because there are various variations to denote a 10G card in the lspci output.
Also the Ganeti hosts needed to be checked in a different way as the interface_primary is private and not the predictive name reported in the net_driver fact.

This is the list, a total of 257 hosts:

analytics[1070,1072].eqiad.wmnet,cloudcephmon[2005-2006]-dev.codfw.wmnet,cloudcontrol2005-dev.codfw.wmnet,clouddb2002-dev.codfw.wmnet,clouddb1021.eqiad.wmnet,cloudgw2003-dev.codfw.wmnet,cloudnet[2005-2006]-dev.codfw.wmnet,cloudservices[2004-2005]-dev.codfw.wmnet,cloudweb2002-dev.wikimedia.org,db[2136-2182,2185-2195,2206-2220].codfw.wmnet,db[1150-1182,1184-1226,1229-1249].eqiad.wmnet,dbstore1007.eqiad.wmnet,elastic1067.eqiad.wmnet,es[2020-2040].codfw.wmnet,es[1020-1040].eqiad.wmnet,ml-serve[2001-2008].codfw.wmnet,ml-serve[1001-1008].eqiad.wmnet,ml-staging[2001-2002].codfw.wmnet,pc[2011-2016].codfw.wmnet,pc[1011-1016].eqiad.wmnet

And this is the same list grouped by names:

analytics[1070,1072]
cloudcephmon[2005-2006]-dev
cloudcontrol2005-dev
clouddb1021
clouddb2002-dev
cloudgw2003-dev
cloudnet[2005-2006]-dev
cloudservices[2004-2005]-dev
cloudweb2002-dev
db[1150-1182,1184-1226,1229-1249,2136-2182,2185-2195,2206-2220]
dbstore1007
elastic1067
es[1020-1040,2020-2040]
ml-serve[1001-1008,2001-2008]
ml-staging[2001-2002]
pc[1011-1016,2011-2016]

Of those 21 are in codfw row A:

db[2136,2142,2145-2146,2153-2158,2175-2176].codfw.wmnet,es[2020,2024,2026-2028].codfw.wmnet,ml-serve[2001,2005].codfw.wmnet,ml-staging2001.codfw.wmnet,pc2011.codfw.wmnet

and 33 in row B:

cloudcephmon[2005-2006]-dev.codfw.wmnet,cloudcontrol2005-dev.codfw.wmnet,clouddb2002-dev.codfw.wmnet,cloudgw2003-dev.codfw.wmnet,cloudnet[2005-2006]-dev.codfw.wmnet,cloudservices[2004-2005]-dev.codfw.wmnet,cloudweb2002-dev.wikimedia.org,db[2137,2143,2147-2148,2159-2164,2177-2178,2185,2188-2189].codfw.wmnet,es[2021,2025,2029-2030,2035].codfw.wmnet,ml-serve[2002,2006].codfw.wmnet,pc2012.codfw.wmnet

Once a migration plan is formed we should open sub-tasks for each team with the action plan.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi triaged this task as Low priority.EditedMar 18 2024, 12:10 PM
ayounsi removed a project: WMF-NDA.
ayounsi subscribed.

Thanks for the task, nothing private in there.

I think we should :
1/ filter out the hosts that are due for a refresh if any
2/ make sure we have appropriate doc/tooling to migrate a host to 10G so the service owner only have to interact with DCops for the cable/NIC change
3/ notify the relevant teams about the possibility to "upgrade" to 10G in exchange of a small downtime (but we shouldn't force them) and to follow up with DCops if needed

A possible follow up is to keep track of the SFP-T stock for future switch upgrades

Edit: looks like (2) is already done in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Change_server_NIC_and_switch_connection,_keeping_IPs
We should then double check if it's still up to date and if it can be simplified or not

ayounsi changed the visibility from "Custom Policy" to "Public (No Login Required)".Mar 18 2024, 12:10 PM

Change 1012680 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/netbox-extras@master] Add Netbox script to change a server's NIC

https://gerrit.wikimedia.org/r/1012680

For the ML hosts - our K8s clusters don't currently require 10G bandwidth, and at the time we didn't want to "waste" 10G ports if not really needed. But if now it is not a problem anymore, we'd be happy to switch (let us know what is the current best practice regarding 1G vs 10G) :)

Screenshot 2024-03-19 at 14-49-16 Upgrade a server to a faster interface NetBox.png (564×1 px, 61 KB)

Screenshot 2024-03-19 at 14-49-29 Upgrade a server to a faster interface NetBox.png (656×1 px, 121 KB)

Feel free to test it on Netbox next

The steps to follow once this script is deployed :

  1. (Optional) Upgrade idrac and NIC firmware: cookbook sre.hardware.upgrade-firmware -n -c idrac -n nic <fqdn>
  2. Run the import from the PuppetDB Netbox script
  3. Run the netbox "upgrade nic" script (this one)
  4. Update E/N/I with the new interface name
  5. Run homer to update the switch port
  6. Physical recable
  7. Reboot the server through console (or just reload the networking service if you're brave)
  8. Run the import from the PuppetDB Netbox script a last time

So straightforward to be run by DCops and the service owner together. If there is a huge need, we can look at writing a cookbook to automate those steps more.

For the ML hosts - our K8s clusters don't currently require 10G bandwidth, and at the time we didn't want to "waste" 10G ports if not really needed. But if now it is not a problem anymore, we'd be happy to switch (let us know what is the current best practice regarding 1G vs 10G) :)

All new switches are 1/10/25G compatible, the rule of thumbs is "please have 10G NICs on the servers if budget permits", so far that I'd redirect you to @wiki_willy.

On the technical side, if your server needs 10G in the short term we can look at racking it in a 10G rack. If it can wait for the switch refresh, then that even better, and we can "upgrade" it when possible.

Hi @elukey - do you want me to change the Lift Wing expansion requests for 16x servers in FY24-25 to 10g? Thanks, Willy

@ayounsi thanks for the patch! LGTM.

Unfortunately I think the approach might not suit in a lot of cases, due to the Trident 3 port-block restriction.

Most servers connected on SFP-based 5100s - i.e. a switch that can do 1G or 10G on any port - are already at 10G I think? We have a lot of servers connected to 5120's using 1G SFPs, which are probably prime candidates for upgrading, but the port-block restriction will probably bite us here. If a switch port is currently running at 1G we can't change it to 10G unless the rest of the ports in its block-of-four are unused, or also changed at the same time.

So we probably need to add another field for the port number, so a different (and compatible) port can be selected. If we add a 'switch' field that will auto-populate once the device is selected we could even make that a drop-down, and restrict the options shown to the user to ports which are valid for the selected speed?

I started implementing a fix for that but it quickly gets complex as it means shutting down a port, and fully setting up another one. Before going that way let's see if it's something we want/need to do.
Also as I think you mentioned somewhere else, it would mess with @Papaul 's rack U to switch port mapping.

The validator in https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/985113 might make it more complex as well, as if it were in effect it would block any migration. I'd need to check if the validator triggers if we migrate all 4 ports at once. It's something we will eventually face.

If we go with the "move all 4 ports at once" option, the script could have a checkbox to skip the switch side changes (or speed changes), to be done in bulk.