Page MenuHomePhabricator

Avoid unnecessary keepalived flap after rebooting servers
Closed, ResolvedPublic

Description

I detected that cloudgw's keepalived daemon flaps when rebooting the primary server:

  • primary server A, secondary server B
  • reboot A, B takes over as primary
  • A boots, takes over as primary

This adds an unnecessary transition, instead it should be:

  • primary server A, secondary server B
  • reboot A, B takes over as primary
  • A boots, nothing else happens, B stays as primary and A is secondary.

The additional transition could add additional instability in the network, becuase when A takes over after the boot, the conntrack information might not be synced yet.

We currently configure keepalived with nopreempt and initial state BACKUP, so I suspect there is a bug somewhere.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-11-03T17:22:24Z] <arturo> [codfw1dev] installing keepalived 2.1.5 from buster-backports on cloudgw2001-dev/2002-dev (T294956)

heads up, keepalived 2.1.5 has a bug that prevents it from working on a VRF like we do in cloudgw: https://github.com/acassen/keepalived/issues/1972

There is a fix upstream, but there is no keepalived release containing it.

This is a blocker for migrating the cloudgw servers to Debian Bullseye. A potential workaround is to run VRRP over a non-VRF-attached interface.

However, I'm not sure the flapping is meaningful for the stability of the network.

I think when this was developed in T272963: cloudgw: develop HA setup I decided this was a non-issue.

Change 736548 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: keepalived: set same priority on the 2 VRRP instances

https://gerrit.wikimedia.org/r/736548

Change 736548 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: keepalived: set same priority on the 2 VRRP instances

Reason:

does not make a difference, see phab task.

https://gerrit.wikimedia.org/r/736548

Change 736999 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: keepalived: introduce service startup delay

https://gerrit.wikimedia.org/r/736999

Change 736999 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: keepalived: introduce service startup delay

Reason:

still doesn't solve the issue, apparently

https://gerrit.wikimedia.org/r/736999

Krinkle renamed this task from keepalived: flap when rebooting servers to Avoid unnecessary keepalived flap after rebooting servers.Jan 10 2022, 6:07 PM
aborrero claimed this task.

I think the root problem here was fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/851087 ,T320975 Let me explain:

The firewall is stateful, but we didn't have an explicit rule accepting the VRRP traffic from the other peer.

When a node sends its own VRRP advert packet, it creates a local conntrack entry that the advert by the remote peer could use. This means that a rebooting node could not see VRRP adverts from the peer node until it sends its own, and at that point the failover is already underway.

I just merged the change, but already did some tests and the reboot time flapping is gone.