Page MenuHomePhabricator

LVS should handle losing a NIC on eqiad and codfw
Open, MediumPublic

Description

Our current LVS setup assumes that the NICs are able to be properly setup to the point that the proper routing rules are created. If for some reason that fails to happen (a down port for example) the traffic for that row would reach the realservers via the default route on the LVS box. This is undesired cause it would make the pybal healthchecks mark the affected servers as up and pool them but ipvs traffic won't be able to reach the mentioned servers.

A potential fix for this would be injecting static routes of type unreachable or blackhole with a lower metric. This would avoid that row specific traffic will reach the realservers through another row via the default route.

Follows-up https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-16_asw-a2-codfw_network

Event Timeline

Legoktm triaged this task as Medium priority.Jul 26 2021, 11:14 PM
Legoktm added a subscriber: Legoktm.

[ Setting priority as part of clinic duty, please retriage if incorrect ]

Krinkle renamed this task from LVS can't handle losing a NIC on eqiad and codfw to LVS should handle losing a NIC on eqiad and codfw.Aug 18 2021, 4:46 AM
Krinkle updated the task description. (Show Details)
BBlack added a subscriber: BBlack.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!