Avoid unnecessary keepalived flap after rebooting servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Nov 3 2021, 5:21 PM

Description

I detected that cloudgw's keepalived daemon flaps when rebooting the primary server:

primary server A, secondary server B
reboot A, B takes over as primary
A boots, takes over as primary

This adds an unnecessary transition, instead it should be:

primary server A, secondary server B
reboot A, B takes over as primary
A boots, nothing else happens, B stays as primary and A is secondary.

The additional transition could add additional instability in the network, becuase when A takes over after the boot, the conntrack information might not be synced yet.

We currently configure keepalived with nopreempt and initial state BACKUP, so I suspect there is a bug somewhere.

Details

	Subject	Repo	Branch	Lines +/-
	cloudgw: keepalived: introduce service startup delay	operations/puppet	production	+11 -0
	cloudgw: keepalived: set same priority on the 2 VRRP instances	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• nskaggs	T294853 2021-11-02 Cloud VPS network outage
Resolved	aborrero	T294955 cloud network: improve automated testing & monitoring
Resolved	aborrero	T294956 Avoid unnecessary keepalived flap after rebooting servers

Event Timeline

aborrero created this task.Nov 3 2021, 5:21 PM

aborrero updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2021-11-03T17:22:24Z] <arturo> [codfw1dev] installing keepalived 2.1.5 from buster-backports on cloudgw2001-dev/2002-dev (T294956)

heads up, keepalived 2.1.5 has a bug that prevents it from working on a VRF like we do in cloudgw: https://github.com/acassen/keepalived/issues/1972

There is a fix upstream, but there is no keepalived release containing it.

This is a blocker for migrating the cloudgw servers to Debian Bullseye. A potential workaround is to run VRRP over a non-VRF-attached interface.

However, I'm not sure the flapping is meaningful for the stability of the network.

I think when this was developed in T272963: cloudgw: develop HA setup I decided this was a non-issue.

Change 736548 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: keepalived: set same priority on the 2 VRRP instances

https://gerrit.wikimedia.org/r/736548

gerritbot added a project: Patch-For-Review.Nov 3 2021, 6:22 PM

reported upstream https://github.com/acassen/keepalived/issues/2032

aborrero triaged this task as Low priority.Nov 3 2021, 6:41 PM

Change 736548 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: keepalived: set same priority on the 2 VRRP instances

Reason:

does not make a difference, see phab task.

https://gerrit.wikimedia.org/r/736548

Maintenance_bot removed a project: Patch-For-Review.Nov 4 2021, 9:10 AM

Change 736999 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: keepalived: introduce service startup delay

https://gerrit.wikimedia.org/r/736999

gerritbot added a project: Patch-For-Review.Nov 5 2021, 10:39 AM

Change 736999 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: keepalived: introduce service startup delay

Reason:

still doesn't solve the issue, apparently

https://gerrit.wikimedia.org/r/736999

Maintenance_bot removed a project: Patch-For-Review.Nov 5 2021, 11:10 AM

Krinkle added a project: Sustainability (Incident Followup).Jan 10 2022, 6:05 PM

Krinkle renamed this task from keepalived: flap when rebooting servers to Avoid unnecessary keepalived flap after rebooting servers.Jan 10 2022, 6:07 PM

I think the root problem here was fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/851087 ,T320975 Let me explain:

The firewall is stateful, but we didn't have an explicit rule accepting the VRRP traffic from the other peer.

When a node sends its own VRRP advert packet, it creates a local conntrack entry that the advert by the remote peer could use. This means that a rebooting node could not see VRRP adverts from the peer node until it sends its own, and at that point the failover is already underway.

I just merged the change, but already did some tests and the reboot time flapping is gone.

Avoid unnecessary keepalived flap after rebooting serversClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Avoid unnecessary keepalived flap after rebooting servers
Closed, ResolvedPublic
Actions

Related Objects
Search...