Page MenuHomePhabricator

Tune arp cache garbage collection for Neutron
Closed, ResolvedPublic

Description

On Sept 28 at 12:36:21 UTC the Neutron router supporting all virtual machines in the Cloud VPS environment failed over from cloudnet1004 to cloudnet1003. There was no outage due to this as everything failed over properly, but it did uncover a potential cause and tuning opportunity.

Router instance:

+--------------------------------------+---------------------+--------+-------+-------------+------+---------+
| ID                                   | Name                | Status | State | Distributed | HA   | Project |
+--------------------------------------+---------------------+--------+-------+-------------+------+---------+
| d93771ba-2711-4f88-804a-8df6fd03978a | cloudinstances2b-gw | ACTIVE | UP    | False       | True | admin   |

Failover log messages:

Sep 28 12:36:21 cloudnet1004 Keepalived_vrrp[2617]: VRRP_Instance(VR_1) Received advert with higher priority 50, ours 50
Sep 28 12:36:21 cloudnet1004 Keepalived_vrrp[2617]: VRRP_Instance(VR_1) Entering BACKUP STATE
...
Sep 28 12:36:21 cloudnet1003 Keepalived_vrrp[2563]: VRRP_Instance(VR_1) Transition to MASTER STATE
Sep 28 12:36:23 cloudnet1003 Keepalived_vrrp[2563]: VRRP_Instance(VR_1) Entering MASTER STATE

The only events leading up to this were arp_cache table overflows, we hit the hard maximum arp cache table size (which can cause hard to debug network connectivity):

Sep 28 12:36:16 cloudnet1004 kernel: [6991332.246983] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.466109] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.595604] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.621507] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.671883] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.700840] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.797729] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.824922] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.879720] neighbour: arp_cache: neighbor table overflow!
Sep 28 12:36:17 cloudnet1004 kernel: [6991332.907211] neighbour: arp_cache: neighbor table overflow!

Under normal load we're really close to the hard maximum table size:

sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a arp -an | wc -l
926

Current ARP table garbage collection thresholds are set to the default value:
minimum number of entries: net.ipv4.neigh.default.gc_thresh1 = 128
soft maximum number of entries: net.ipv4.neigh.default.gc_thresh2 = 512
hard maximum number of entries: net.ipv4.neigh.default.gc_thresh3 = 1024

Currently we have over 950 virtual machines, plus dhcp and router instances on this network. These kernel parameters should be increased for our environment on the cloudnet servers hosting the Neutron routers.

Event Timeline

to accommodate our current environment and growth I'd like to increase these values on the cloudnet servers to:

net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 4096

Change 540216 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: increase apr cache table for cloudnet hosts

https://gerrit.wikimedia.org/r/540216

Change 540216 merged by Jhedden:
[operations/puppet@production] openstack: increase arp cache table for cloudnet hosts

https://gerrit.wikimedia.org/r/540216