Last week's reboot of hydrogen, one of the two recdns in eqiad, caused a bunch of issues.
Currently, pybal depools servers by removing them from the virtual service (ipvsadm -d). IPVS has known packet loss issues when removing servers from UDP virtual services.
We should update pybal to do the following in case of planned maintenance:
- set weight to zero
- schedule server removal after a certain amount of time (if still under maintenance)
Similarly, in case of service failure:
- set weight to zero
- if failure persists, remove server
We should also consider enabling expire_nodest_conn. From ipvs-sysctl.txt:
expire_nodest_conn - BOOLEAN 0 - disabled (default) not 0 - enabled The default value is 0, the load balancer will silently drop packets when its destination server is not available. It may be useful, when user-space monitoring program deletes the destination server (because of server overload or wrong detection) and add back the server later, and the connections to the server can continue. If this feature is enabled, the load balancer will expire the connection immediately when a packet arrives and its destination server is not available, then the client program will be notified that the connection is closed. This is equivalent to the feature some people requires to flush connections when its destination is not available.