Page MenuHomePhabricator

cloud: neutron l3 agent: improve failover handling
Open, MediumPublic

Description

I detected 2 issues with how neutron l3 agent handles failover scenarios:

  • it doesn't synchronize conntrack entries to the passive node. When a failover happens, all NAT'ed TCP connections are shutdown and should be re-established.
  • the VRRP instrumentation by the internal keepalived is possibly not working due to a config mistake (uses some weird vxlan interface that is not connected anywhere). This might leave failover detection to just the openstack neutron internal logic. When failover happens, keepalived is unable to know the status of the other peer, resulting in the annoying state change ping-pong that we have been experimenting.

Both combined results in the l3 agent not being as reliable as it should be.
I believe I can fix both of them, but unfortunately upstream openstack conntrackd adoption seems stalled, so we might need to deploy it ourselves.

Action items:

  • investigate the weird keepalived VXLAN setup
  • investigate ways to deploy conntrackd in a neutron-aware fashion to "manually" sync conntrack NAT states between the l3 agent peers
  • set net.netfilter.nf_conntrack_tcp_be_liberal=1 in neutron netns
  • do more tests!

Event Timeline

aborrero renamed this task from cloud: neutron l3 agent doesn't sync conntrack info, resulting in lost connections in failover scenarios to cloud: neutron l3 agent: improve failover handling.Nov 24 2020, 12:10 PM
aborrero triaged this task as Medium priority.
aborrero updated the task description. (Show Details)
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.
aborrero added subscribers: Bstorm, Andrew.

on the keepalive/vrrp/vxlan thing, I might be wrong after all:

root@cloudnet2002-dev:~# ip -d link show vxlan-1
12: vxlan-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master brqd967e056-ef state UNKNOW
    link/ether c2:dc:1f:9b:83:13 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 
    vxlan id 1 group 224.0.0.1 dev eno1 srcport 0 0 dstport 8472 ttl auto ageing 300 udpcsum noudp6zeroc
    bridge_slave state forwarding priority 32 cost 100 hairpin off guard off root_block off fastleave of 8000.e:6b:be:86:24:ae designated_root 8000.e:6b:be:86:24:ae hold_timer    0.00 message_age_timer    0.0_router 1 mcast_fast_leave off mcast_flood on neigh_suppress off group_fwd_mask 0 group_fwd_mask_str 0x0 65535 
root@cloudnet2002-dev:~# tcpdump -i eno1 udp port 8472
12:50:58.070025 IP cloudnet2003-dev.codfw.wmnet.40942 > all-systems.mcast.net.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 169.254.192.5 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20

The multicast packets are received in all hosts in the production vlan cloud-hosts1-b-codfw:

aborrero@cloudvirt2001-dev:~$ sudo tcpdump -i eno1 udp port 8472
12:52:30.075652 IP cloudnet2003-dev.codfw.wmnet.40942 > all-systems.mcast.net.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 169.254.192.5 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 50, authtype none, intvl 2s, length 20

So this seems to work. The only bit I would improve is to stop using to production vlan for this kind of UDP/multicast messages. Related docs: https://docs.openstack.org/neutron/rocky/admin/deploy-lb-ha-vrrp.html

on the keepalive/vrrp/vxlan thing, I might be wrong after all:

(I verified the same behavior in eqiad1)

Change 644556 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: add conntrackd for better neutron l3 agent failover

https://gerrit.wikimedia.org/r/644556

Mentioned in SAL (#wikimedia-cloud) [2020-12-02T12:41:48Z] <arturo> disable puppet in all cloudnet servers to merge conntrackd change T268335

Change 644556 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: add conntrackd for better neutron l3 agent failover

https://gerrit.wikimedia.org/r/644556

Change 644805 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: l3 agent: fix conntrackd hiera configuration

https://gerrit.wikimedia.org/r/644805

Change 644805 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: l3 agent: fix conntrackd hiera configuration

https://gerrit.wikimedia.org/r/644805

Change 644841 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: l3_agent: fix typo in hiera

https://gerrit.wikimedia.org/r/644841

Change 644841 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: l3_agent: fix typo in hiera

https://gerrit.wikimedia.org/r/644841

Mentioned in SAL (#wikimedia-cloud) [2020-12-02T15:33:00Z] <arturo> [codfw1dev] conntrackd is now up and running in cloudnet200x-dev nodes (T268335)

Mentioned in SAL (#wikimedia-cloud) [2020-12-02T15:36:25Z] <arturo> conntrackd is now up and running in cloudnet1003/1004 nodes (T268335)

Mentioned in SAL (#wikimedia-cloud) [2020-12-02T17:25:27Z] <arturo> [15:51] failovering neutron virtual router in eqiad1 (T268335)

Change 645107 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: l3 agent: refresh conntrackd configuration

https://gerrit.wikimedia.org/r/645107

Change 645107 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: l3 agent: refresh conntrackd configuration

https://gerrit.wikimedia.org/r/645107

Mentioned in SAL (#wikimedia-cloud) [2021-01-13T12:40:52Z] <arturo> try increasing systemd watchdog timeout for conntrackd in cloudnet1004 (T268335)

Change 655890 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] neutron: conntrackd: double systemd watchdog timeout

https://gerrit.wikimedia.org/r/655890

Change 655890 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] neutron: conntrackd: double systemd watchdog timeout

https://gerrit.wikimedia.org/r/655890

NOTE: at openstack upgrade time, when neutron-l3-agent restarts, the netns is deleted, dropping all the conntracks. If the timing is right enough, both netns are deleted and there is no point we used conntrackd to sync them, there is no netns to hold them and therefore all connections are lost.

HINT: activate conntrack_tcp_be_liberal in the neutron netns.

Change 664845 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron l3: activate net.netfilter.nf_conntrack_tcp_be_liberal

https://gerrit.wikimedia.org/r/664845

Change 664845 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron l3: activate net.netfilter.nf_conntrack_tcp_be_liberal

https://gerrit.wikimedia.org/r/664845

Change 675491 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: l3_agent: conntrackd: stop using systemd Watchdog

https://gerrit.wikimedia.org/r/675491

Change 675491 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: l3_agent: conntrackd: stop using systemd Watchdog

https://gerrit.wikimedia.org/r/675491

Change 675493 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: nrpe-neutron-conntrack: increase magic value to 90

https://gerrit.wikimedia.org/r/675493

Change 675493 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: nrpe-neutron-conntrack: increase magic value to 90

https://gerrit.wikimedia.org/r/675493

Change 675496 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron: l3 agent: reduce net.netfilter.nf_conntrack_tcp_timeout_established

https://gerrit.wikimedia.org/r/675496

Change 675496 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron: reduce net.netfilter.nf_conntrack_tcp_timeout_established

https://gerrit.wikimedia.org/r/675496

Change 675530 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron: conntrackd: disable startup resync

https://gerrit.wikimedia.org/r/675530

Change 675530 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron: conntrackd: disable startup resync

https://gerrit.wikimedia.org/r/675530