We want to benchmark a Mellonox NIC to see if we can get more throughput than our current Broadcom chips: How does the combination of a new piece of hardware and its underlying drivers affect our workload?
lvs1017 will have a 4 port nic ordered via T381118. This task will track the installation.
lvs1016 will handle responsibilities while lvs1017 is being tested.
Racking and Testing High-level overview
- Run decom cookbook for lvs1016
- Physically move lvs1016 to rack A7
- Connect lvs1016 primary 10G port (enp4s0f0) to a free 10G port on asw2-a7-eqiad
- Run the Netbox provision script for lvs1016 to add this primary link in Netbox, assign it IPs etc.
- Add lvs1016
- Submit CRs
- Update modules/profile/manifests/lvs/configuration.pp
- Add lvs1016 to the end of the list for high-traffic1
- Add lvs1016 to $lvs_classes, setting it to high-traffic.
- Add a hieradata override for lvs1016 (hieradata/hosts/lvs1016.yaml) and set profile::pybal::override_bgp_med: 200
- Add lvs1016 IPs to haproxy_allowed_healthcheck_sources in hieradata/common.yaml
- Update modules/profile/manifests/lvs/configuration.pp
- Reimage lvs1016
- Create lvs1016 hieradata override for profile::lvs::interface_tweaks
- Set BGP to true in lvs1016's netbox page
- Run sudo homer "cr*-eqiad*" commit "enable BGP on lvs1016" on cumin
- Submit CRs
- Remove lvs1017
- Downtime lvs1017
- Stop Puppet on lvs1017
- Stop PyBal on lvs1017
- Verify lvs1020 has taken over traffic via Grafana
- Run the decommission cookbook for lvs1017
- Promote lvs1016
- Verify Icinga alerts and connectivity for lvs1016
- Submit CRs
- Promote lvs1016 in modules/profile/manifests/lvs/configuration.pp's high-traffic1 (Final list being lvs1016, lvs1020)
- Remove lvs1017 from hieradata/common.yaml, modules/profile/manifests/lvs/configuration.pp, and hieradata/common/lvs/interfaces.yaml
- Remove MED override for lvs1016 (hieradata/hosts/lvs1016.yaml)
- Run run-puppet-agent on lvs1016
- Restart pybal on lvs1016, setting it to primary
- Restart pybal on lvs1020 to sync changes.
- Set up lvs1017 with new NIC
- dcops Remove lvs1017 from the rack, install the Mellanox NIC in it in the primary PCIe slot
- dcops Move lvs1017 to rack E2, connect its primary uplink to any spare port
- Run the Netbox provision script for lvs1017 to document this link and assign the server appropriate IPs on private1-e2-eqiad vlan
- Reimage lvs1017 to whatever role is needed for the testing
In case of unexpected emergency, depool eqiad with sudo cookbook sre.dns.admin depool eqiad on cumin.