Page MenuHomePhabricator

pybal fails to detect dead servers under production lb IPs for port 80
Closed, ResolvedPublic

Description

We've had an outage last night during which a server (cp1046) failed. It was successfully detected as dead by pybal and depooled from mobile-lb on both IPv4/IPv6, but only for port 443, not 80. It remained pooled in port 80, which resulted in 1/4th of the requests to be essentially dropped, causing the Icinga LVS check to flap.

From a very quick investigation, I saw two issues here that need further investigation:

  • port 80 has only "IdleConnection" configured, not ProxyFetch. Presumably due to issues with ProxyFetch failing if the response code is a 3xx, but this is something that should be fixed regardless.
  • IdleConnection failed to detect the dead server. That sounds like a larger issue.

Event Timeline

faidon raised the priority of this task from to Unbreak Now!.
faidon updated the task description. (Show Details)
faidon added projects: acl*sre-team, Traffic, PyBal.
faidon subscribed.
Joe set Security to None.

I was not able to reproduce this behaviour in a small test setup, but in the meantime I implemented support for 3xx responses in proxyfetch, which should help alleviate practically the effects of this error in production.

I finally got a repeatable way to reproduce this behaviour:

  • set a service to test IdleConnection only on a backend running apache
  • stop apache on the backend with 'service apache2 stop'

What you see on the pybal host is that said host still thinks there is an established connection to the backend (.140)

tcp        0      0 10.192.16.139:43643     10.192.16.140:80        ESTABLISHED 6750/python

so it seems that the connection has never effectively failed for the python process. Debugging further, I started looking at what happens with tcpdump.

Here is what happens when I start pybal, let it run undistrubed for two minutes, then I stop apache on the backend:

root@pybal-test2001:~# tcpdump port 80 and host pybal-test2002.codfw.wmnet  -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:32:16.902618 IP (tos 0x0, ttl 64, id 57185, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [S], cksum 0x36c5 (incorrect -> 0x6a61), seq 2467863677, win 29200, options [mss 1460,sackOK,TS val 393036572 ecr 0,nop,wscale 9], length 0
15:32:16.902985 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [S.], cksum 0x5d3f (correct), seq 3052934214, ack 2467863678, win 28960, options [mss 1460,sackOK,TS val 393030741 ecr 393036572,nop,wscale 9], length 0
15:32:16.903004 IP (tos 0x0, ttl 64, id 57186, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [.], cksum 0x36bd (incorrect -> 0xfcf3), seq 1, ack 1, win 58, options [nop,nop,TS val 393036572 ecr 393030741], length 0
15:32:47.902570 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [S.], cksum 0x3ef9 (correct), seq 3052934214, ack 2467863678, win 28960, options [mss 1460,sackOK,TS val 393038491 ecr 393036572,nop,wscale 9], length 0
15:32:47.902616 IP (tos 0x0, ttl 64, id 57187, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [.], cksum 0x36bd (incorrect -> 0xdead), seq 1, ack 1, win 58, options [nop,nop,TS val 393044322 ecr 393030741], length 0



15:34:16.158808 IP (tos 0x0, ttl 64, id 24284, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [F.], cksum 0x6a37 (correct), seq 1, ack 1, win 57, options [nop,nop,TS val 393060555 ecr 393044322], length 0
15:34:16.159311 IP (tos 0x0, ttl 64, id 57188, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [F.], cksum 0x36bd (incorrect -> 0x1405), seq 1, ack 2, win 58, options [nop,nop,TS val 393066386 ecr 393060555], length 0
15:34:16.159752 IP (tos 0x0, ttl 64, id 24285, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [.], cksum 0x1406 (correct), seq 2, ack 2, win 57, options [nop,nop,TS val 393060555 ecr 393066386], length 0
15:34:16.161121 IP (tos 0x0, ttl 64, id 25448, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2001.codfw.wmnet.43656 > pybal-test2002.codfw.wmnet.http: Flags [S], cksum 0x36c5 (incorrect -> 0xa3b3), seq 2899236091, win 29200, options [mss 1460,sackOK,TS val 393066387 ecr 0,nop,wscale 9], length 0
15:34:16.161670 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43656: Flags [S.], cksum 0x8e78 (correct), seq 2121981798, ack 2899236092, win 28960, options [mss 1460,sackOK,TS val 393060555 ecr 393066387,nop,wscale 9], length 0
15:34:16.161705 IP (tos 0x0, ttl 64, id 25449, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43656 > pybal-test2002.codfw.wmnet.http: Flags [.], cksum 0x36bd (incorrect -> 0x2e2d), seq 1, ack 1, win 58, options [nop,nop,TS val 393066387 ecr 393060555], length 0

Change 244717 had a related patch set uploaded (by Ori.livneh):
IdleConnection: set keepalive

https://gerrit.wikimedia.org/r/244717

@Joe pointed out on IRC that the default tcp_keepalive_time is 300s, which is much longer than we'd like to take to recognize a dead connection. Setting it to a lower value is possible, but the setting is system-wide, so we'd have to think carefully about the ramifications.

Another possibility would be to have IdleConnection create a new connection every N seconds and then terminate the previous one. If we took that approach, we would continue to be able to instantly detect a connection which has been closed politely by the peer, and we would detect dead connections in N seconds or less.

So even though the other end closed the connection properly, pybal doesn't find out until the keepalive timeout elapses? Simply because the connection is idle? Would occasionally attempting to read from the idle connection resolve this? It seems very counter-intuitive that a "closed" connection would remain until the timeout elapses, but several things about TCP are counter intuitive so I guess I shouldn't complain.

Change 244717 merged by jenkins-bot:
IdleConnection: set keepalive

https://gerrit.wikimedia.org/r/244717

As we worked out on the patch, the intervals are all configurable per-socket. this is now resolved with the new pybal package.

This was actually caused by AcceptFilter being default enabled in Apache nowadays - this results in connections without data never being passed to Apache, and consequently also not being terminated when Apache dies. See T119372 for more details.

I think we should disable AcceptFilter in our (internal) Apache configuration, so IdleConnection can do what it was intended for again. Having TCP KeepAlive enabled is useful for backup as well.

Change 256968 had a related patch set uploaded (by Ori.livneh):
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/256968

Change 256968 merged by Ori.livneh:
Disable accept filters for HTTP on canary app servers

https://gerrit.wikimedia.org/r/256968

Actually I don't think *this* was caused by the apache filter - we were seeing this on varnish as well; while it was rather the failures once the keepalive was enabled that were caused by the acceptfilter directive in apache.

So testing this with a single apache in a pool, I can see that AcceptFilter http none doesn't change the behaviour of Idleconnection: it will fail to detect a faulty link unless we do enable keepalive.

So we need both the keepalive and this fix for apache.

Change 257388 had a related patch set uploaded (by Ori.livneh):
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/257388

Change 257388 merged by Ori.livneh:
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/257388

So testing this with a single apache in a pool, I can see that AcceptFilter http none doesn't change the behaviour of Idleconnection: it will fail to detect a faulty link unless we do enable keepalive.

So we need both the keepalive and this fix for apache.

@Joe: I don't think that's correct. I tested it as well, and it worked perfectly fine without keepalive. Why would keepalive be needed when a RST is sent?

So testing this with a single apache in a pool, I can see that AcceptFilter http none doesn't change the behaviour of Idleconnection: it will fail to detect a faulty link unless we do enable keepalive.

So we need both the keepalive and this fix for apache.

@Joe: I don't think that's correct. I tested it as well, and it worked perfectly fine without keepalive. Why would keepalive be needed when a RST is sent?

I just tested it again, and reconfirmed that it does work without KeepAlive, when TCP_DEFER_ACCEPT is not used.

But, Varnish seems to use TCP_DEFER_ACCEPT as well. We should think about what to do with that - it's at least somewhat user facing.

One idea I'm toying with is to perhaps extend IdleConnection by having it send some (custom) data on connect - just enough to get past the accept() stage.

Of course, not using IdleConnection and relying on ProxyFetch for Varnish would work as well - just not as quick a response on some failures.

This task has been "Unbreak now" priority since it was created and has seen no updates for nearly two months.
Is the current priority still realistic?

@Aklapper the bug is solved in the code but needs to be deployed to production, which will happen very soon.