Today we've rebooted conf2001.codfw.wmnet into a new kernel. After the system came back online, we've noticed that all codfw LVSs, which use conf2001 as their etcd backend, had no established TCP connections with it. Hosts were thus silently not being pooled/depooled upon admin request.
We need to:
- make sure pybal attempts reconnecting to etcd in these situations
- implement an icinga check to alert us whenever a running pybal has 0 established TCP connections to its etcd(s)