Page MenuHomePhabricator

LVS should handle losing a NIC on eqiad and codfw
Open, MediumPublic

Description

Our current LVS setup assumes that the NICs are able to be properly setup to the point that the proper routing rules are created. If for some reason that fails to happen (a down port for example) the traffic for that row would reach the realservers via the default route on the LVS box. This is undesired cause it would make the pybal healthchecks mark the affected servers as up and pool them but ipvs traffic won't be able to reach the mentioned servers.

A potential fix for this would be injecting static routes of type unreachable or blackhole with a lower metric. This would avoid that row specific traffic will reach the realservers through another row via the default route.

Follows-up https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-16_asw-a2-codfw_network

Event Timeline

Legoktm triaged this task as Medium priority.Jul 26 2021, 11:14 PM
Legoktm subscribed.

[ Setting priority as part of clinic duty, please retriage if incorrect ]

Krinkle renamed this task from LVS can't handle losing a NIC on eqiad and codfw to LVS should handle losing a NIC on eqiad and codfw.Aug 18 2021, 4:46 AM
Krinkle updated the task description. (Show Details)
BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

@Vgutierrez Since there is a project to replace LVS in the horizon, is this still worth pursuing?

akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam

Another approach is to put them in a distinct namespace (one without a default route) see T114979: Run IPVS in a separate network namespace

Another approach is to put them in a distinct namespace (one without a default route) see T114979: Run IPVS in a separate network namespace

My own instincts would be to avoid all that complexity. Definitely an interesting suggestion, but given we are working towards a new L4LB I think maybe such a major change is best avoided.

A simple blackhole route for each range with high metric ought to do the trick?