As shown in T113151, without setting a tcp keepalive on the sockets, idleconnection is almost useless as a mechanism for services like apache.
Nonetheless, since pybal 0.13 has hit production, we've seen that some pools started to show issues with single backends getting disconnected in swarms shortly after startup (or in the first few minutes after startup).
The error is the same we get when an effective disconnection happens (e.g. when we manually stop apache on a backend).
I was able to reproduce the problem on our test systems and shown that:
- Without keepalive, no disconnection happens. In fact, it seems no failure on the backend is ever noticed (if the backend is e.g. apache).
- With keepalive, we can experience some disconnections early after startup, but the situation seems to stabilize on the long run. This only happens if the total of idleconnectionmonitors is higher than ~ 100 (for the whole pybal process).
So there seems to be a problem somewhere in between the twisted socket library and the kernel here. I still have no idea what is really causing these disconnects under the hood. A stopgap solution would be to make it possible for idleconnection to depool a server only after a new connection attempt fails.