Page MenuHomePhabricator

Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved.
Closed, DeclinedPublic

Description

As shown in T113151, without setting a tcp keepalive on the sockets, idleconnection is almost useless as a mechanism for services like apache.

Nonetheless, since pybal 0.13 has hit production, we've seen that some pools started to show issues with single backends getting disconnected in swarms shortly after startup (or in the first few minutes after startup).

The error is the same we get when an effective disconnection happens (e.g. when we manually stop apache on a backend).

I was able to reproduce the problem on our test systems and shown that:

  1. Without keepalive, no disconnection happens. In fact, it seems no failure on the backend is ever noticed (if the backend is e.g. apache).
  2. With keepalive, we can experience some disconnections early after startup, but the situation seems to stabilize on the long run. This only happens if the total of idleconnectionmonitors is higher than ~ 100 (for the whole pybal process).

So there seems to be a problem somewhere in between the twisted socket library and the kernel here. I still have no idea what is really causing these disconnects under the hood. A stopgap solution would be to make it possible for idleconnection to depool a server only after a new connection attempt fails.

Event Timeline

Joe raised the priority of this task from to High.
Joe updated the task description. (Show Details)
Joe added projects: PyBal, SRE.
Joe subscribed.
Joe added subscribers: mark, BBlack.

From a full strace of pybal, I was able to see what happens when we get that warning in pybal: the socket closes as it receives an ECONNRESET error. Here is the relevant stack trace:

1448283239.082998 fcntl(320, F_GETFL)   = 0x2 (flags O_RDWR)
1448283239.083008 fcntl(320, F_SETFL, O_RDWR|O_NONBLOCK) = 0
1448283239.083024 fcntl(320, F_GETFD)   = 0
1448283239.083036 fcntl(320, F_SETFD, FD_CLOEXEC) = 0
1448283239.083054 fcntl(320, F_GETFL)   = 0x802 (flags O_RDWR|O_NONBLOCK)
1448283239.083063 fcntl(320, F_SETFL, O_RDWR|O_NONBLOCK) = 0
1448283239.292285 getsockopt(320, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
1448283239.292312 connect(320, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.192.16.64")}, 16) = -1 EINPROGRESS (Operation now in progress)
1448283239.316708 getsockopt(320, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
1448283239.316727 connect(320, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.192.16.64")}, 16) = 0
1448283239.316791 setsockopt(320, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
1448283239.316806 setsockopt(320, SOL_TCP, TCP_KEEPIDLE, [1], 4) = 0
1448283239.316819 setsockopt(320, SOL_TCP, TCP_KEEPCNT, [10], 4) = 0
1448283239.316832 setsockopt(320, SOL_TCP, TCP_KEEPINTVL, [1], 4) = 0
1448283303.291546 recvfrom(320, 0x2ceae94, 65536, 0, 0, 0) = -1 ECONNRESET (Connection reset by peer)
1448283303.291895 shutdown(320, SHUT_RDWR) = -1 ENOTCONN (Transport endpoint is not connected)
1448283303.291960 close(320)

So, it sounds like apache sends a RST after about 64 seconds? That's probably not normal behavior for a normal connection, but then again our IdleConn connections probably aren't "normal" to apache either, since they hang after sending zero bytes of output (no request header). Since IdleConn knows nothing about the application layer protocol, and many application protocol servers probably do something similar on lack of valid input after some timeout, I'd say there are basically two paths forward we can do either/or with:

  1. Set a configurable IdleConn timeout on the pybal side, and disconnect->reconnect when it expires without triggering an alert or failure (assuming success on the reconnect), then configure that timeout on a per-service level to be lower than the appserver's timeout.
  2. Have IdleConn not care about the "reason" for disconnect (e.g. RST) and in general assume remote connection close is OK, so long as immediate reconnect succeeds quickly.

With both of these, there are tricky issues around response-time lag on true host failure, as the whole point of IdleConn is to get a very fast indicator when things go south.

So, it sounds like apache sends a RST after about 64 seconds? That's probably not normal behavior for a normal connection, but then again our IdleConn connections probably aren't "normal" to apache either, since they hang after sending zero bytes of output (no request header). Since IdleConn knows nothing about the application layer protocol, and many application protocol servers probably do something similar on lack of valid input after some timeout, I'd say there are basically two paths forward we can do either/or with:

But if it were Apache doing that, presumably we wouldn't see a difference between keepalive off or on, as (likely) Apache is oblivious to whether keepalive is used.

We should probably match this to a pcap output to see if this is the result of a RST sent from the backend, or the local kernel deciding the other side is gone (keepalive).

@mark we know for a fact it's the keepalive intervening, as if we don't set the keepalive the idleconnection will remain alive;

Also, this is dependent on the size of the pool at the pybal level, it cannot depend on backend actions, or it would show up for smaller pools too.

@mark we know for a fact it's the keepalive intervening, as if we don't set the keepalive the idleconnection will remain alive;

Also, this is dependent on the size of the pool at the pybal level, it cannot depend on backend actions, or it would show up for smaller pools too.

Fair enough. But I guess we could look at the timing of those keepalive packets. :)

I've looked at this some more. Using a pcap, I found that the affected backend hosts were in fact sending a TCP RST after exactly 64s, indicating the teardown of the connection. Further investigation showed that after exactly 30s after the establishment of the connection, the backend host was also sending a retransmission of the SYN/ACK packet, as if it never received the initial ACK from the PyBal host in the first place.

With TCP keepalive disabled, this seemed to be happening too - except perhaps the kernel on the PyBal side was ignoring it.

I played with the TCP KeepAlive timing parameters some (increasing KEEPIDLE, KEEPINTVL) which seemed to directly lengthen the time it takes for the connection to get torn down, but it would still happen.

Looking on the side of the backend MediaWiki application server, I noticed that when IdleConnection would create a connection to Apache, it would stay in SYN_RECV instead of ESTABLISHED for a long time - as if it didn't receive the ACK indeed. However, tcpdump showed that the ACK was indeed being received, and confirmed that the backend was sending retransmits of the SYN/ACK as well - why?

A socat on port 80 instead of Apache didn't display that at all.

I figured the lack of any data on the connection might be the problem, and thought of FreeBSD's accept filters. It turns out that Apache on Linux uses TCP_DEFER_ACCEPT by default nowadays, which is a lighter version of it, and defers the accept() until there's data on the connection.

Putting a

AcceptFilter http none

in the Apache configuration indeed made the problem go away. It also caused IdleConnection to be notified of Apache terminating/dying again (T113151) - probably this didn't happen as the sockets for IdleConnection never got associated to the Apache processes in the first place, as they never send any data.

As we don't need AcceptFilter on our internal Apache servers behind PyBal, it seems best to just turn that off and make IdleConnection functional again. That also means we can probably set the TCP KeepAlive timers to be less aggressive and only use it for backup.

Change 256968 had a related patch set uploaded (by Ori.livneh):
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/256968

Change 256968 merged by Ori.livneh:
Disable accept filters for HTTP on canary app servers

https://gerrit.wikimedia.org/r/256968

Change 257388 had a related patch set uploaded (by Ori.livneh):
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/257388

Change 257388 merged by Ori.livneh:
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/257388

akosiaris lowered the priority of this task from High to Medium.Jan 21 2016, 3:42 PM
akosiaris subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall subscribed.

Hello!

PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial changes for stability/predictability. While we would love to service this ticket we do not have the resources to devote to PyBal any more. To keep our work queue organized and prioritized, I will decline this ticket.

(For the unlikely case of PyBal's continued usage here's a common search string to search for re-opening these declined tickets: aiZ6ohm6)