Page MenuHomePhabricator

LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain
Open, HighPublic

Description

Layer 2 load balancing (such as Pybal, WMF's production load balancer) requires nodes to be in its broadcast domain(s).

We had an incident Wednesday-Friday where a couple of hosts were added to PyBal rotation without the necessary VLAN plumbing in place , so they were pooled, but unable to send traffic to users. This had significant user impact (1% of all searches resulted in errors) , so I'm requesting that we monitor and alert for this situation.

The linked phab comment demonstrates a way to detect this situation:

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
default via 10.64.32.1 dev eno1np0 onlink

bad, LVS host is routing traffic, which will never work.

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
10.64.152.0/24 dev vlan1047 proto kernel scope link src 10.64.152.19

good, pooled node is directly connected

Details

TitleReferenceAuthorSource BranchDest Branch
Output status to stdout instead of stderrrepos/search-platform/sre/lvs_l2_checker!4bkingstdoutmain
Source backend node info from pools.json.repos/search-platform/sre/lvs_l2_checker!3bkingpools.jsonmain
Source backend node info from pools.json.repos/search-platform/sre/lvs_l2_checker!2bkingpools.jsonmain
LVS: monitor for l2 connectivityrepos/search-platform/sre/lvs_l2_checker!1bkingmvpmain
Customize query in GitLab

Event Timeline

Thanks Brian.

Yeah I think the thing the check would need to do is:

  1. Get the current list of active back-end IPs

I'm not at all sure how to do that from etcd but I'm sure it's not too hard.

  1. Check that the system has a valid ARP entry for that host
cmooney@lvs1019:~$ ip --json neigh show 10.64.152.2 
[{"dst":"10.64.152.2","dev":"vlan1047","lladdr":"14:23:f2:c2:96:e0","state":["REACHABLE"]}]

We could probably just do "ip --json neigh show" and parse the whole table rather than spawning multiple commands.

  1. If the system does not we should try to ping the IP, and then repeat to see if ARP table has populated for it.
  1. If it doesn't we need to alert that the IP is configured but LVS can't ARP for it.

Actually it may be just easier to check the route for each pooled IP and make sure the check doesn't return saying it's using the default as per the task descr.

cmooney@lvs1019:~$ ip --json route get fibmatch 1.1.1.1 
[{"dst":"default","gateway":"10.64.32.1","dev":"eno1np0","flags":["onlink"]}]

The ARP check could maybe catch some other edge cases, such as a particular backend down, but checking the system is not using its default for a given IP is enough to catch a missing vlan int.

FWiW, I've started work on a simple script we could use for monitoring , but I ran into some weirdness with config-master (See T364037 ). Should be easy enough to work around, but just wanted to give a progress report.

bking renamed this task from LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain to LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain.Thu, May 9, 8:06 PM

Change #1030185 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] lvs: add script to check for L2 connectivity

https://gerrit.wikimedia.org/r/1030185

@ssingh @cmooney If you want to see the script in action, you can try it from my homedir on lvs1019 (expected to pass) or on cumin2002 (expected to fail). Not required, but it could be more practical than a code review.

I think that the proposed check covers only a very specific failure scenario that is unlikely to happen again before Liberica while it doesn't cover other possible failure scenarios related to the L2 reachability of the backend server on the VIP address.

I still think that a better solution could be achieved without having to touch Pybal's code given that one of the monitors (runcommand) IIRC allows to execute an external command. This external monitor could properly check the connectivity of the backend using the destination VIP and prevent traffic to be sent to the bogus backend in the first place.

I even suggested a solution like this when I joined back in 2016 as I had wrote the year before a simple C program that would perform exactly that check to prevent exactly those issues.

My 2 cents.

I think that the proposed check covers only a very specific failure scenario that is unlikely to happen again before Liberica

That is true. Still I feel it doesn't hurt to have this one scenario alerted on.

I still think that a better solution could be achieved without having to touch Pybal's code given that one of the monitors (runcommand) IIRC allows to execute an external command. This external monitor could properly check the connectivity of the backend using the destination VIP and prevent traffic to be sent to the bogus backend in the first place.

That sounds better indeed. That option didn't come up in the previous discussion, but if it can be done easily and is agreeable to Traffic then let's do it instead!

This comment was removed by bking.

I still think that a better solution could be achieved without having to touch Pybal's code given that one of the monitors (runcommand) IIRC allows to execute an external command. This external monitor could properly check the connectivity of the backend using the destination VIP and prevent traffic to be sent to the bogus backend in the first place.

@Volans Can you share more details about this "external monitor?" I assume it runs on the LVS host itself? If so, can we make this the default for all new pools? IMHO if the load balancer requires L2 connectivity, we should be checking for it.

while it doesn't cover other possible failure scenarios related to the L2 reachability of the backend server on the VIP address.

Can you elaborate on these other failure scenarios? Maybe we could add them to the script.

I even suggested a solution like this when I joined back in 2016 as I had wrote the year before a simple C program that would perform exactly that check to prevent exactly those issues.

Do you still have the code, and does it account for those other failure scenarios you mentioned? Maybe we could adapt it for Python.

Before any more digression I'd like a chime in from Traffic to clarify if an external binary monitor is an option in the current pybal that can be achieved with configuration only changes and is actually on the table.