During the handling of the incident related to asw-a2-codfw going down I brought down and up the row A NIC interfaces for both lvs2010 and lvs2009. This left the routing table of both load balancers in an undesired state where row A traffic would be routed through the default route. This is pretty bad basically because pybal healthchecks were able to reach the realservers located on row A but IPVS working at L2 failed to redirect the traffic to them, triggering "random" TCP connection errors every time that a request was handled to a realserver located in row A.
Please proceed with this steps in the order that they are listed here:
[] Bring lvs2007 online
[] Run puppet on lvs2007
[] Restart pybal on lvs2007
At this point lvs2010 shouldn't be handling traffic anymore as lvs2007, lvs2008 and lvs2009 should be properly running
[] Stop pybal on lvs2010
[] Clear the manually added routes on lvs2010 to avoid row A realservers being reached via the default route:
-- sudo -i ip route del 10.192.0.0/22 dev ens2f1np1
-- sudo -i ip route del 208.80.153.0/27 dev ens2f1np1
-- sudo -i ip -6 route del 2620:0:860:1::/64 dev ens2f1np1
-- sudo -i ip -6 route del 2620:0:860:101::/64 dev ens2f1np1
[] Bring down and up ens2f1np1 on lvs2010 (ifdown ens2f1np1 && ifup ens2f1np1) and be sure that the routes for vlan IDs 2017 and 2001 are there. It should look like this:
```
~ $ ip route |grep ens2f1np1
10.192.0.0/22 dev ens2f1np1.2017 proto kernel scope link src 10.192.1.8
208.80.153.0/27 dev ens2f1np1.2001 proto kernel scope link src 208.80.153.19
~$ ip -6 route |grep ens2f1np1
2620:0:860:1::/64 dev ens2f1np1.2001 proto kernel metric 256 expires 2591996sec pref medium
2620:0:860:101::/64 dev ens2f1np1.2017 proto kernel metric 256 expires 2591996sec pref medium
fe80::/64 dev ens2f1np1 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2017 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2001 proto kernel metric 256 pref medium
default via fe80::1 dev ens2f1np1.2001 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
default via fe80::1 dev ens2f1np1.2017 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
```
[] Reenable & run puppet on lvs2010
[] Restart pybal on lvs2010
At this point lvs2010 should be ready to handle incoming traffic. Check that everything is green on icinga.
[] Stop pybal on lvs2009. This effectively depools lvs2009, check that everything seems fine and that lvs2010 takes over lvs2009 traffic as expected: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs
[] Clear the manually added routes on lvs2009 to avoid row A realservers being reached via the default route:
-- sudo -i ip route del 10.192.0.0/22 dev ens2f1np1
-- sudo -i ip route del 208.80.153.0/27 dev ens2f1np1
-- sudo -i ip -6 route del 2620:0:860:1::/64 dev ens2f1np1
-- sudo -i ip -6 route del 2620:0:860:101::/64 dev ens2f1np1
[] Bring down and up ens2f1np1 on lvs2009 (ifdown ens2f1np1 && ifup ens2f1np1) and be sure that the routes for vlan IDs 2017 and 2001 are there. It should look like this:
```
~ $ ip route |grep ens2f1np1
10.192.0.0/22 dev ens2f1np1.2017 proto kernel scope link src 10.192.1.8
208.80.153.0/27 dev ens2f1np1.2001 proto kernel scope link src 208.80.153.19
~$ ip -6 route |grep ens2f1np1
2620:0:860:1::/64 dev ens2f1np1.2001 proto kernel metric 256 expires 2591996sec pref medium
2620:0:860:101::/64 dev ens2f1np1.2017 proto kernel metric 256 expires 2591996sec pref medium
fe80::/64 dev ens2f1np1 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2017 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2001 proto kernel metric 256 pref medium
default via fe80::1 dev ens2f1np1.2001 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
default via fe80::1 dev ens2f1np1.2017 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
```
[] Reenable & run puppet on lvs2009
[] Restart pybal on lvs2009
At this point lvs2009 should be handling the traffic again
[] Revert https://gerrit.wikimedia.org/r/c/operations/dns/+/705348