Page MenuHomePhabricator

Actions to restore lvs2009/lvs2010 network configuration
Closed, ResolvedPublic

Description

During the handling of the incident related to asw-a2-codfw going down I brought down and up the row A NIC interfaces for both lvs2010 and lvs2009. This left the routing table of both load balancers in an undesired state where row A traffic would be routed through the default route. This is pretty bad basically because pybal healthchecks were able to reach the realservers located on row A but IPVS working at L2 failed to redirect the traffic to them, triggering "random" TCP connection errors every time that a request was handled to a realserver located in row A.

Please proceed with this steps in the order that they are listed here:

  • Enable lvs2007 port on asw-a2-codfw replacement
  • Bring lvs2007 online
  • Run puppet on lvs2007
  • Restart pybal on lvs2007

At this point lvs2010 shouldn't be handling traffic anymore as lvs2007, lvs2008 and lvs2009 should be properly running

  • Stop pybal on lvs2010
  • Enable lvs2010 port on asw-a2-codfw replacement
  • Clear the manually added routes on lvs2010 to avoid row A realservers being reached via the default route:
    • sudo -i ip route del 10.192.0.0/22 dev ens2f1np1
    • sudo -i ip route del 208.80.153.0/27 dev ens2f1np1
    • sudo -i ip -6 route del 2620:0:860:1::/64 dev ens2f1np1
    • sudo -i ip -6 route del 2620:0:860:101::/64 dev ens2f1np1
  • Bring down and up ens2f1np1 on lvs2010 (ifdown ens2f1np1 && ifup ens2f1np1) and be sure that the routes for vlan IDs 2017 and 2001 are there. It should look like this:
~ $ ip route |grep ens2f1np1
10.192.0.0/22 dev ens2f1np1.2017 proto kernel scope link src 10.192.1.8 
208.80.153.0/27 dev ens2f1np1.2001 proto kernel scope link src 208.80.153.19
~$ ip -6 route |grep ens2f1np1
2620:0:860:1::/64 dev ens2f1np1.2001 proto kernel metric 256 expires 2591996sec pref medium
2620:0:860:101::/64 dev ens2f1np1.2017 proto kernel metric 256 expires 2591996sec pref medium
fe80::/64 dev ens2f1np1 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2017 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2001 proto kernel metric 256 pref medium
default via fe80::1 dev ens2f1np1.2001 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
default via fe80::1 dev ens2f1np1.2017 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
  • Reenable & run puppet on lvs2010
  • Restart pybal on lvs2010

At this point lvs2010 should be ready to handle incoming traffic. Check that everything is green on icinga.

  • Stop pybal on lvs2009. This effectively depools lvs2009, check that everything seems fine and that lvs2010 takes over lvs2009 traffic as expected: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs
  • Enable lvs2009 port on asw-a2-codfw replacement
  • Clear the manually added routes on lvs2009 to avoid row A realservers being reached via the default route:
    • sudo -i ip route del 10.192.0.0/22 dev ens2f1np1
    • sudo -i ip route del 208.80.153.0/27 dev ens2f1np1
    • sudo -i ip -6 route del 2620:0:860:1::/64 dev ens2f1np1
    • sudo -i ip -6 route del 2620:0:860:101::/64 dev ens2f1np1
  • Bring down and up ens2f1np1 on lvs2009 (ifdown ens2f1np1 && ifup ens2f1np1) and be sure that the routes for vlan IDs 2017 and 2001 are there. It should look like this:
~ $ ip route |grep ens2f1np1
10.192.0.0/22 dev ens2f1np1.2017 proto kernel scope link src 10.192.1.8 
208.80.153.0/27 dev ens2f1np1.2001 proto kernel scope link src 208.80.153.19
~$ ip -6 route |grep ens2f1np1
2620:0:860:1::/64 dev ens2f1np1.2001 proto kernel metric 256 expires 2591996sec pref medium
2620:0:860:101::/64 dev ens2f1np1.2017 proto kernel metric 256 expires 2591996sec pref medium
fe80::/64 dev ens2f1np1 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2017 proto kernel metric 256 pref medium
fe80::/64 dev ens2f1np1.2001 proto kernel metric 256 pref medium
default via fe80::1 dev ens2f1np1.2001 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
default via fe80::1 dev ens2f1np1.2017 proto ra metric 1024 expires 596sec hoplimit 64 pref medium
  • Reenable & run puppet on lvs2009
  • Restart pybal on lvs2009

At this point lvs2009 should be handling the traffic again

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-07-19T18:20:29Z] <vgutierrez> enabling pybal on lvs2007 - T286921

Mentioned in SAL (#wikimedia-operations) [2021-07-19T18:22:57Z] <vgutierrez> disable puppet & stop pybal on lvs2010 - T286921

Mentioned in SAL (#wikimedia-operations) [2021-07-19T18:35:46Z] <vgutierrez> running puppet and restarting pybal on lvs2010 - T286921

Mentioned in SAL (#wikimedia-operations) [2021-07-19T18:40:24Z] <vgutierrez> stop pybal on lvs2009 - T286921

Mentioned in SAL (#wikimedia-operations) [2021-07-19T18:46:51Z] <topranks> Running homer to re-enable port xe-2/0/43 on asw2-a2-codfw (lvs2009) - T286921

Mentioned in SAL (#wikimedia-operations) [2021-07-19T18:53:12Z] <vgutierrez> running puppet and restarting pybal on lvs2009 - T286921

Change 705466 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/dns@master] Revert "admin_state: Depool codfw text"

https://gerrit.wikimedia.org/r/705466

Change 705466 merged by Vgutierrez:

[operations/dns@master] Revert "admin_state: Depool codfw text"

https://gerrit.wikimedia.org/r/705466

Vgutierrez claimed this task.
Vgutierrez updated the task description. (Show Details)