On Sept 28 at 12:36:21 UTC the Neutron router supporting all virtual machines in the Cloud VPS environment failed over from cloudnet1004 to cloudnet1003. There was no outage due to this as everything failed over properly, but it did uncover a potential cause and tuning opportunity.
Router instance:
+--------------------------------------+---------------------+--------+-------+-------------+------+---------+ | ID | Name | Status | State | Distributed | HA | Project | +--------------------------------------+---------------------+--------+-------+-------------+------+---------+ | d93771ba-2711-4f88-804a-8df6fd03978a | cloudinstances2b-gw | ACTIVE | UP | False | True | admin |
Failover log messages:
Sep 28 12:36:21 cloudnet1004 Keepalived_vrrp[2617]: VRRP_Instance(VR_1) Received advert with higher priority 50, ours 50 Sep 28 12:36:21 cloudnet1004 Keepalived_vrrp[2617]: VRRP_Instance(VR_1) Entering BACKUP STATE ... Sep 28 12:36:21 cloudnet1003 Keepalived_vrrp[2563]: VRRP_Instance(VR_1) Transition to MASTER STATE Sep 28 12:36:23 cloudnet1003 Keepalived_vrrp[2563]: VRRP_Instance(VR_1) Entering MASTER STATE
The only events leading up to this were arp_cache table overflows, we hit the hard maximum arp cache table size (which can cause hard to debug network connectivity):
Sep 28 12:36:16 cloudnet1004 kernel: [6991332.246983] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.466109] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.595604] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.621507] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.671883] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.700840] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.797729] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.824922] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.879720] neighbour: arp_cache: neighbor table overflow! Sep 28 12:36:17 cloudnet1004 kernel: [6991332.907211] neighbour: arp_cache: neighbor table overflow!
Under normal load we're really close to the hard maximum table size:
sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a arp -an | wc -l 926
Current ARP table garbage collection thresholds are set to the default value:
minimum number of entries: net.ipv4.neigh.default.gc_thresh1 = 128
soft maximum number of entries: net.ipv4.neigh.default.gc_thresh2 = 512
hard maximum number of entries: net.ipv4.neigh.default.gc_thresh3 = 1024
Currently we have over 950 virtual machines, plus dhcp and router instances on this network. These kernel parameters should be increased for our environment on the cloudnet servers hosting the Neutron routers.