LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain
Open, HighPublic
Actions

Assigned To

None

Authored By

	bking
	Apr 29 2024, 1:48 PM

Description

Layer 2 load balancing (such as Pybal, WMF's production load balancer) requires nodes to be in its broadcast domain(s).

We had an incident Wednesday-Friday where a couple of hosts were added to PyBal rotation without the necessary VLAN plumbing in place , so they were pooled, but unable to send traffic to users. This had significant user impact (1% of all searches resulted in errors) , so I'm requesting that we monitor and alert for this situation.

The linked phab comment demonstrates a way to detect this situation:

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
default via 10.64.32.1 dev eno1np0 onlink

↑ bad, LVS host is routing traffic, which will never work.

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
10.64.152.0/24 dev vlan1047 proto kernel scope link src 10.64.152.19

↑ good, pooled node is directly connected

Details

	Subject	Repo	Branch	Lines +/-
	lvs: add script to check for L2 connectivity	operations/puppet	production	+188 -0

Customize query in gerrit

Title	Reference	Author	Source Branch	Dest Branch
Output status to stdout instead of stderr	repos/search-platform/sre/lvs_l2_checker!4	bking	stdout	main
Source backend node info from pools.json.	repos/search-platform/sre/lvs_l2_checker!3	bking	pools.json	main
Source backend node info from pools.json.	repos/search-platform/sre/lvs_l2_checker!2	bking	pools.json	main
LVS: monitor for l2 connectivity	repos/search-platform/sre/lvs_l2_checker!1	bking	mvp	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		bking	T363694 Post incident tasks: Search missing results/unavailable for some eqiad users
		Open		None	T363702 LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain

Event Timeline

bking created this task.Apr 29 2024, 1:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2024, 1:48 PM

bking added a parent task: T363694: Post incident tasks: Search missing results/unavailable for some eqiad users.Apr 29 2024, 2:03 PM

Thanks Brian.

Yeah I think the thing the check would need to do is:

Get the current list of active back-end IPs

I'm not at all sure how to do that from etcd but I'm sure it's not too hard.

Check that the system has a valid ARP entry for that host

cmooney@lvs1019:~$ ip --json neigh show 10.64.152.2 
[{"dst":"10.64.152.2","dev":"vlan1047","lladdr":"14:23:f2:c2:96:e0","state":["REACHABLE"]}]

We could probably just do "ip --json neigh show" and parse the whole table rather than spawning multiple commands.

If the system does not we should try to ping the IP, and then repeat to see if ARP table has populated for it.

If it doesn't we need to alert that the IP is configured but LVS can't ARP for it.

Actually it may be just easier to check the route for each pooled IP and make sure the check doesn't return saying it's using the default as per the task descr.

cmooney@lvs1019:~$ ip --json route get fibmatch 1.1.1.1 
[{"dst":"default","gateway":"10.64.32.1","dev":"eno1np0","flags":["onlink"]}]

The ARP check could maybe catch some other edge cases, such as a particular backend down, but checking the system is not using its default for a given IP is enough to catch a missing vlan int.

bking mentioned this in T364037: Investigate why pools.json does not match https://config-master.wikimedia.org/pybal/${datacenter}/${service} T363702.May 2 2024, 6:16 PM

FWiW, I've started work on a simple script we could use for monitoring , but I ran into some weirdness with config-master (See T364037 ). Should be easy enough to work around, but just wanted to give a progress report.

Gehel moved this task from Incoming to Watching on the Data-Platform-SRE board.May 3 2024, 3:45 PM

bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/1

LVS: monitor for l2 connectivity

bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/1

LVS: monitor for l2 connectivity

Maintenance_bot removed a project: Patch-For-Review.Wed, May 8, 2:31 PM

bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/2

Source backend node info from pools.json.

cmooney closed https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/2

Source backend node info from pools.json.

Maintenance_bot removed a project: Patch-For-Review.Thu, May 9, 6:31 PM

bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/3

Source backend node info from pools.json.

bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/3

Source backend node info from pools.json.

bking renamed this task from LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain to LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain.Thu, May 9, 8:06 PM

bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/4

Output status to stdout instead of stderr

cmooney closed https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/4

Output status to stdout instead of stderr

Maintenance_bot removed a project: Patch-For-Review.Thu, May 9, 9:31 PM

bking triaged this task as High priority.Fri, May 10, 3:59 PM

bking edited projects, added Data-Platform-SRE (2024.05.06 - 2024.05.26); removed Data-Platform-SRE.

Change #1030185 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] lvs: add script to check for L2 connectivity

https://gerrit.wikimedia.org/r/1030185

gerritbot added a project: Patch-For-Review.Fri, May 10, 4:03 PM

@ssingh @cmooney If you want to see the script in action, you can try it from my homedir on lvs1019 (expected to pass) or on cumin2002 (expected to fail). Not required, but it could be more practical than a code review.

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.05.06 - 2024.05.26) board.Mon, May 13, 4:18 PM

I think that the proposed check covers only a very specific failure scenario that is unlikely to happen again before Liberica while it doesn't cover other possible failure scenarios related to the L2 reachability of the backend server on the VIP address.

I still think that a better solution could be achieved without having to touch Pybal's code given that one of the monitors (runcommand) IIRC allows to execute an external command. This external monitor could properly check the connectivity of the backend using the destination VIP and prevent traffic to be sent to the bogus backend in the first place.

I even suggested a solution like this when I joined back in 2016 as I had wrote the year before a simple C program that would perform exactly that check to prevent exactly those issues.

My 2 cents.

In T363702#9794749, @Volans wrote:

I think that the proposed check covers only a very specific failure scenario that is unlikely to happen again before Liberica

That is true. Still I feel it doesn't hurt to have this one scenario alerted on.

I still think that a better solution could be achieved without having to touch Pybal's code given that one of the monitors (runcommand) IIRC allows to execute an external command. This external monitor could properly check the connectivity of the backend using the destination VIP and prevent traffic to be sent to the bogus backend in the first place.

That sounds better indeed. That option didn't come up in the previous discussion, but if it can be done easily and is agreeable to Traffic then let's do it instead!

bking added a comment.Tue, May 14, 2:52 PM

This comment was removed by bking.

In T363702#9794749, @Volans wrote:

I still think that a better solution could be achieved without having to touch Pybal's code given that one of the monitors (runcommand) IIRC allows to execute an external command. This external monitor could properly check the connectivity of the backend using the destination VIP and prevent traffic to be sent to the bogus backend in the first place.

@Volans Can you share more details about this "external monitor?" I assume it runs on the LVS host itself? If so, can we make this the default for all new pools? IMHO if the load balancer requires L2 connectivity, we should be checking for it.

while it doesn't cover other possible failure scenarios related to the L2 reachability of the backend server on the VIP address.

Can you elaborate on these other failure scenarios? Maybe we could add them to the script.

I even suggested a solution like this when I joined back in 2016 as I had wrote the year before a simple C program that would perform exactly that check to prevent exactly those issues.

Do you still have the code, and does it account for those other failure scenarios you mentioned? Maybe we could adapt it for Python.

Before any more digression I'd like a chime in from Traffic to clarify if an external binary monitor is an option in the current pybal that can be achieved with configuration only changes and is actually on the table.

Gehel added a project: Sustainability (Incident Followup).Thu, May 23, 3:06 PM

Gehel edited projects, added Data-Platform-SRE (2024.05.27 - 2024.06.16); removed Data-Platform-SRE (2024.05.06 - 2024.05.26).Fri, May 24, 12:20 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.05.27 - 2024.06.16) board.

LVS hosts: Monitor/alert when pooled nodes are outside broadcast domainOpen, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

LVS hosts: Monitor/alert when pooled nodes are outside broadcast domain
Open, HighPublic
Actions

Related Objects
Search...