Page MenuHomePhabricator

Audit eqiad & codfw LVS network links
Open, HighPublic

Description

During T286787 we detected that high-traffic1 and the secondary LVS for codfw were getting their row A traffic from the very same switch (A2) effectively weakening our HA LVS setup.

LVS networking links for codfw and eqiad should be checked to avoid the same kind of issue in the future.

Follows-up https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-16_asw-a2-codfw_network

Event Timeline

eqiad:

HostRowHost ifaceswitch iface
lvs1013Aenp4s0f0xe-7/0/34
lvs1014Aenp4s0f1xe-4/0/18
lvs1015Aenp5s0f0xe-2/0/0
lvs1016Aenp5s0f0xe-4/0/7
lvs1013Benp4s0f1xe-4/0/15
lvs1014Benp4s0f0xe-7/0/29
lvs1015Benp4s0f1xe-2/0/3
lvs1016Benp5s0f0xe-4/0/34
lvs1013Cenp5s0f0xe-4/0/32
lvs1014Cenp5s0f0xe-2/0/13
lvs1015Cenp5s0f1xe-7/0/19
lvs1016Cenp5s0f0xe-4/0/5
lvs1013Denp5s0f1xe-2/0/12
lvs1014Denp5s0f1xe-7/0/4
lvs1015Denp5s0f1xe-2/0/4
lvs1016Denp4s0f0xe-7/0/15

codfw:

HostRowHost ifaceswitch iface
lvs2007Aens2f0np0xe-2/0/45
lvs2008Aens2f1np1xe-7/0/45
lvs2009Aens2f1np1xe-2/0/43
lvs2010Aens2f1np1xe-2/0/44
lvs2007Bens2f1np1xe-7/0/45
lvs2008Bens2f0np0xe-2/0/45
lvs2009Bens3f0np0xe-2/0/43
lvs2010Bens3f0np0xe-2/0/44
lvs2007Cens3f0np0xe-7/0/45
lvs2008Cens3f0np0xe-2/0/45
lvs2009Cens2f0np0xe-2/0/44
lvs2010Cens3f1np1xe-2/0/43
lvs2007Dens3f1np1xe-7/0/45
lvs2008Dens3f1np1xe-2/0/46
lvs2009Dens3f1np1xe-2/0/43
lvs2010Dens2f0np0xe-2/0/44

@Papaul maybe I'm missing some limitation, but there is any reason to just be using two switches per row on codfw for LVS networking links? Adding a third switch per row would be great for availability purposes, could we do that?

@Vgutierrez the only limitation i cans see is the number of NIC ports on each server. Each server has 4 NIC's each NIC connected to 1 row on 1 switch. If we do keep the existing configuration then we will have to add another 4 ports NIC in each server so that each server can have 2 connections on each row.

@Papaul nope.. the idea would be replace some of the current links with new ones to additional switches

@Vgutierrez if my understanding is right you want for example lvsX NIC 1 to switch asw-a2 NIC 2 to switch asw-a7 (2 switches in ROW A) and NIC 3 to switch asw-b2 and NIC 4 to swirch asw-b7 (2 switch in ROW B) and so on ?

the current problem is that both lvs2007 (primary for high-traffic1) and lvs2010 (secondary) get row A traffic from the very same switch, so if that switch fails we lose row A in both primary and secondary LVS. This also happens between lvs2009 and lvs2010 regarding row A, lvs2008, lvs2009 and lvs2010 for row B. lvs2008, lvs2009 and lvs2010 for row C and lvs2008, lvs2009 and lvs2010 for row D as well.

So it looks like that we should move lvs2010 row A, B and C to another switch on those rows and move lvs2008 and lvs2009 row D connections to another switch on row D

Ok understood. Please provide me with the configuration you want in a table like above for each server which NIC connects to which switch and i can do my site audit on how best to re-route those links.

HostRowHost ifaceswitch ifaceswitch name
lvs2007Aens2f0np0xe-2/0/45A2
lvs2008Aens2f1np1xe-7/0/45A7
lvs2009Aens2f1np1xe-2/0/43A2
lvs2010Aens2f1np1Switch A4A4
lvs2007Bens2f1np1xe-7/0/45A7
lvs2008Bens2f0np0xe-2/0/45B2
lvs2009Bens3f0np0xe-2/0/43B2
lvs2010Bens3f0np0Switch B4B4
lvs2007Cens3f0np0xe-7/0/45C7
lvs2008Cens3f0np0xe-2/0/45C2
lvs2009Cens2f0np0xe-2/0/44C2
lvs2010Cens3f1np1Switch C4C4
lvs2007Dens3f1np1xe-7/0/45C7
lvs2008Dens3f1np1Switch D4D4
lvs2009Dens3f1np1Switch D4D4
lvs2010Dens2f0np0xe-2/0/44D2
HostHost ifaceswitch ifaceswitch namechange notesiface on new siwtchComplete
lvs2007ens2f0np0xe-2/0/45asw-a2-codfwno changeno change
lvs2007ens2f1np1xe-7/0/45asw-b7-codfwno changeno change
lvs2007ens3f0np0xe-7/0/45asw-c7-codfwno changeno change
lvs2007ens3f1np1xe-7/0/45asw-d7-codfwno changeno change
lvs2008ens2f0np0xe-2/0/45asw-b2-codfwno changeno change
lvs2008ens2f1np1xe-7/0/45asw-a7-codfwno changeno change
lvs2008ens3f0np0xe-2/0/45asw-c2-codfwno changeno change
lvs2008ens3f1np1xe-2/0/46asw-d4-codfwfrom D2 to D4xe-4/0/47yes
lvs2009ens2f0np0xe-2/0/44asw-c2-codfwno changeno change
lvs2009ens2f1np1xe-2/0/43asw-a2-codfwno changeno change
lvs2009ens3f0np0xe-2/0/43asw-b2-codfwno changeno change
lvs2009ens3f1np1xe-2/0/43asw-d2-codfwfrom d2 to d4xe-4/0/46yes
lvs2010ens2f0np0xe-2/0/44asw-d2-codfwno changeno change
lvs2010ens2f1np1xe-2/0/44asw-a4-codfwfrom A2 to a4xe-4/0/47yes
lvs2010ens3f0np0xe-2/0/44asw-b4-codfwfrom b2 to b4xe-4/0/47yes
lvs2010ens3f1np1xe-2/0/43asw-c4-codfwfrom c2 to c4xe-4/0/47yes
HostRowHost ifaceswitch ifaceswitch name
lvs2007Aens2f0np0xe-2/0/45A2
lvs2008Aens2f1np1xe-7/0/45A7
lvs2009Aens2f1np1xe-2/0/43A2
lvs2010Aens2f1np1Switch A4A4
lvs2007Bens2f1np1xe-7/0/45A7
lvs2008Bens2f0np0xe-2/0/45B2
lvs2009Bens3f0np0xe-2/0/43B2
lvs2010Bens3f0np0Switch B4B4
lvs2007Cens3f0np0xe-7/0/45C7
lvs2008Cens3f0np0xe-2/0/45C2
lvs2009Cens2f0np0xe-2/0/44C2
lvs2010Cens3f1np1Switch C4C4
lvs2007Dens3f1np1xe-7/0/45C7
lvs2008Dens3f1np1Switch D4D4
lvs2009Dens3f1np1Switch D4D4
lvs2010Dens2f0np0xe-2/0/44D2

@Vgutierrez Base on the table you provided lvs2007 has it’s 2 first NIC’s connected to row A( asw-a2 and asw-a7) and the other 2 NIC ‘s connected to row c ( asw-c7 and asw C7) can you please double check this. Thanks

@Papaul that's a mistake on my side, thanks for spotting it, the second NIC ens2f1np1 is actually connected to B7

@Vgutierrez thank you. What about lvs2007 ens3f1np1? Actually it is connected to d7 and you want it to be moved to C7 or lvs2007 ens3f0np0 is already connected to C7

@Papaul same thing.. lvs2007 ens3f1np1 is connected to D7, the only desired changes are the new links against A4, B4, C4 and D4

@Vgutierrez thank you I have all the information needed. I will do my site audit and get back with you next week to setup a day and time to start moving those links

@Vgutierrez do you have time today 10am CT to move only
lvs2010 ens2f1np1 xe-2/0/44 asw-a4-codfw from A2 to a4 xe-4/0/47

@Papaul we need to coordinate with @ayounsi or @cmooney to let them configure the ports on asw-a4-codfw. For me it's basically a NOOP on lvs2010, I just need to check that the interfaces are up and working after you change it

I can set up the interface on asw-a4-codfw

Mentioned in SAL (#wikimedia-operations) [2021-08-05T14:53:52Z] <vgutierrez@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: T286881

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: T286881

lvs2008.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-08-05T14:53:59Z] <vgutierrez@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: T286881

Mentioned in SAL (#wikimedia-operations) [2021-08-05T15:11:16Z] <vgutierrez@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2009.codfw.wmnet with reason: T286881

Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: T286881

lvs2009.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-08-05T15:11:22Z] <vgutierrez@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2009.codfw.wmnet with reason: T286881