I detected 2 issues with how neutron l3 agent handles failover scenarios:
- it doesn't synchronize conntrack entries to the passive node. When a failover happens, all NAT'ed TCP connections are shutdown and should be re-established.
the VRRP instrumentation by the internal keepalived is possibly not working due to a config mistake (uses some weird vxlan interface that is not connected anywhere). This might leave failover detection to just the openstack neutron internal logic. When failover happens, keepalived is unable to know the status of the other peer, resulting in the annoying state change ping-pong that we have been experimenting.
Both combined results in the l3 agent not being as reliable as it should be.
I believe I can fix both of them, but unfortunately upstream openstack conntrackd adoption seems stalled, so we might need to deploy it ourselves.
Action items:
-
investigate the weird keepalived VXLAN setup - investigate ways to deploy conntrackd in a neutron-aware fashion to "manually" sync conntrack NAT states between the l3 agent peers
- set net.netfilter.nf_conntrack_tcp_be_liberal=1 in neutron netns
- do more tests!