pybal fails to detect dead servers under production lb IPs for port 80
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Sep 19 2015, 6:42 PM

Description

We've had an outage last night during which a server (cp1046) failed. It was successfully detected as dead by pybal and depooled from mobile-lb on both IPv4/IPv6, but only for port 443, not 80. It remained pooled in port 80, which resulted in 1/4th of the requests to be essentially dropped, causing the Icinga LVS check to flap.

From a very quick investigation, I saw two issues here that need further investigation:

port 80 has only "IdleConnection" configured, not ProxyFetch. Presumably due to issues with ProxyFetch failing if the response code is a 3xx, but this is something that should be fixed regardless.
IdleConnection failed to detect the dead server. That sounds like a larger issue.

Details

Subject	Repo	Branch	Lines +/-
Prevent Apache from setting TCP_DEFER_ACCEPT by default	operations/puppet	production	+8 -14
Disable accept filters for HTTP on canary app servers	operations/puppet	production	+14 -0
IdleConnection: set keepalive	operations/debs/pybal	master	+40 -27

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Joe	T113151 pybal fails to detect dead servers under production lb IPs for port 80
		Resolved		Joe	T102393 Make pybal accept 30[12] for ProxyFetch

Event Timeline

faidon created this task.Sep 19 2015, 6:42 PM

faidon raised the priority of this task from to Unbreak Now!.

faidon updated the task description. (Show Details)

faidon added projects: acl*sre-team, Traffic, PyBal.

faidon subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 19 2015, 6:42 PM

Joe claimed this task.Sep 21 2015, 4:01 PM

Joe set Security to None.

BBlack added a subtask: T102393: Make pybal accept 30[12] for ProxyFetch.Sep 21 2015, 4:02 PM

BBlack moved this task from Backlog to Traffic team actively servicing on the Traffic board.Sep 22 2015, 2:01 PM

Aklapper mentioned this in T102393: Make pybal accept 30[12] for ProxyFetch.Sep 29 2015, 11:12 AM

mark subscribed.Oct 2 2015, 11:29 AM

I was not able to reproduce this behaviour in a small test setup, but in the meantime I implemented support for 3xx responses in proxyfetch, which should help alleviate practically the effects of this error in production.

I finally got a repeatable way to reproduce this behaviour:

set a service to test IdleConnection only on a backend running apache
stop apache on the backend with 'service apache2 stop'

What you see on the pybal host is that said host still thinks there is an established connection to the backend (.140)

tcp        0      0 10.192.16.139:43643     10.192.16.140:80        ESTABLISHED 6750/python

so it seems that the connection has never effectively failed for the python process. Debugging further, I started looking at what happens with tcpdump.

Here is what happens when I start pybal, let it run undistrubed for two minutes, then I stop apache on the backend:

root@pybal-test2001:~# tcpdump port 80 and host pybal-test2002.codfw.wmnet  -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:32:16.902618 IP (tos 0x0, ttl 64, id 57185, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [S], cksum 0x36c5 (incorrect -> 0x6a61), seq 2467863677, win 29200, options [mss 1460,sackOK,TS val 393036572 ecr 0,nop,wscale 9], length 0
15:32:16.902985 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [S.], cksum 0x5d3f (correct), seq 3052934214, ack 2467863678, win 28960, options [mss 1460,sackOK,TS val 393030741 ecr 393036572,nop,wscale 9], length 0
15:32:16.903004 IP (tos 0x0, ttl 64, id 57186, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [.], cksum 0x36bd (incorrect -> 0xfcf3), seq 1, ack 1, win 58, options [nop,nop,TS val 393036572 ecr 393030741], length 0
15:32:47.902570 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [S.], cksum 0x3ef9 (correct), seq 3052934214, ack 2467863678, win 28960, options [mss 1460,sackOK,TS val 393038491 ecr 393036572,nop,wscale 9], length 0
15:32:47.902616 IP (tos 0x0, ttl 64, id 57187, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [.], cksum 0x36bd (incorrect -> 0xdead), seq 1, ack 1, win 58, options [nop,nop,TS val 393044322 ecr 393030741], length 0



15:34:16.158808 IP (tos 0x0, ttl 64, id 24284, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [F.], cksum 0x6a37 (correct), seq 1, ack 1, win 57, options [nop,nop,TS val 393060555 ecr 393044322], length 0
15:34:16.159311 IP (tos 0x0, ttl 64, id 57188, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43654 > pybal-test2002.codfw.wmnet.http: Flags [F.], cksum 0x36bd (incorrect -> 0x1405), seq 1, ack 2, win 58, options [nop,nop,TS val 393066386 ecr 393060555], length 0
15:34:16.159752 IP (tos 0x0, ttl 64, id 24285, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43654: Flags [.], cksum 0x1406 (correct), seq 2, ack 2, win 57, options [nop,nop,TS val 393060555 ecr 393066386], length 0
15:34:16.161121 IP (tos 0x0, ttl 64, id 25448, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2001.codfw.wmnet.43656 > pybal-test2002.codfw.wmnet.http: Flags [S], cksum 0x36c5 (incorrect -> 0xa3b3), seq 2899236091, win 29200, options [mss 1460,sackOK,TS val 393066387 ecr 0,nop,wscale 9], length 0
15:34:16.161670 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    pybal-test2002.codfw.wmnet.http > pybal-test2001.codfw.wmnet.43656: Flags [S.], cksum 0x8e78 (correct), seq 2121981798, ack 2899236092, win 28960, options [mss 1460,sackOK,TS val 393060555 ecr 393066387,nop,wscale 9], length 0
15:34:16.161705 IP (tos 0x0, ttl 64, id 25449, offset 0, flags [DF], proto TCP (6), length 52)
    pybal-test2001.codfw.wmnet.43656 > pybal-test2002.codfw.wmnet.http: Flags [.], cksum 0x36bd (incorrect -> 0x2e2d), seq 1, ack 1, win 58, options [nop,nop,TS val 393066387 ecr 393060555], length 0

Change 244717 had a related patch set uploaded (by Ori.livneh):
IdleConnection: set keepalive

https://gerrit.wikimedia.org/r/244717

gerritbot added a project: Patch-For-Review.Oct 9 2015, 5:41 PM

@Joe pointed out on IRC that the default tcp_keepalive_time is 300s, which is much longer than we'd like to take to recognize a dead connection. Setting it to a lower value is possible, but the setting is system-wide, so we'd have to think carefully about the ramifications.

Another possibility would be to have IdleConnection create a new connection every N seconds and then terminate the previous one. If we took that approach, we would continue to be able to instantly detect a connection which has been closed politely by the peer, and we would detect dead connections in N seconds or less.

So even though the other end closed the connection properly, pybal doesn't find out until the keepalive timeout elapses? Simply because the connection is idle? Would occasionally attempting to read from the idle connection resolve this? It seems very counter-intuitive that a "closed" connection would remain until the timeout elapses, but several things about TCP are counter intuitive so I guess I shouldn't complain.

Change 244717 merged by jenkins-bot:
IdleConnection: set keepalive

https://gerrit.wikimedia.org/r/244717

As we worked out on the patch, the intervals are all configurable per-socket. this is now resolved with the new pybal package.

Joe closed this task as Resolved.Nov 3 2015, 12:58 PM

Joe closed subtask T102393: Make pybal accept 30[12] for ProxyFetch as Resolved.

Joe mentioned this in T119372: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. .Nov 23 2015, 11:23 AM

BBlack moved this task from Traffic team actively servicing to Done on the Traffic board.Nov 30 2015, 5:26 PM

This was actually caused by AcceptFilter being default enabled in Apache nowadays - this results in connections without data never being passed to Apache, and consequently also not being terminated when Apache dies. See T119372 for more details.

I think we should disable AcceptFilter in our (internal) Apache configuration, so IdleConnection can do what it was intended for again. Having TCP KeepAlive enabled is useful for backup as well.

Change 256968 had a related patch set uploaded (by Ori.livneh):
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/256968

Change 256968 merged by Ori.livneh:
Disable accept filters for HTTP on canary app servers

https://gerrit.wikimedia.org/r/256968

Actually I don't think *this* was caused by the apache filter - we were seeing this on varnish as well; while it was rather the failures once the keepalive was enabled that were caused by the acceptfilter directive in apache.

@Joe: easy to test, no? :)

So testing this with a single apache in a pool, I can see that AcceptFilter http none doesn't change the behaviour of Idleconnection: it will fail to detect a faulty link unless we do enable keepalive.

So we need both the keepalive and this fix for apache.

Change 257388 had a related patch set uploaded (by Ori.livneh):
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/257388

Change 257388 merged by Ori.livneh:
Prevent Apache from setting TCP_DEFER_ACCEPT by default

https://gerrit.wikimedia.org/r/257388

In T113151#1858288, @Joe wrote:

So testing this with a single apache in a pool, I can see that AcceptFilter http none doesn't change the behaviour of Idleconnection: it will fail to detect a faulty link unless we do enable keepalive.

So we need both the keepalive and this fix for apache.

@Joe: I don't think that's correct. I tested it as well, and it worked perfectly fine without keepalive. Why would keepalive be needed when a RST is sent?

In T113151#1861965, @mark wrote:

In T113151#1858288, @Joe wrote:

So testing this with a single apache in a pool, I can see that AcceptFilter http none doesn't change the behaviour of Idleconnection: it will fail to detect a faulty link unless we do enable keepalive.

So we need both the keepalive and this fix for apache.

@Joe: I don't think that's correct. I tested it as well, and it worked perfectly fine without keepalive. Why would keepalive be needed when a RST is sent?

I just tested it again, and reconfirmed that it does work without KeepAlive, when TCP_DEFER_ACCEPT is not used.

But, Varnish seems to use TCP_DEFER_ACCEPT as well. We should think about what to do with that - it's at least somewhat user facing.

One idea I'm toying with is to perhaps extend IdleConnection by having it send some (custom) data on connect - just enough to get past the accept() stage.

Of course, not using IdleConnection and relying on ProxyFetch for Varnish would work as well - just not as quick a response on some failures.

Aklapper removed a project: Patch-For-Review.Dec 20 2015, 1:42 PM

BBlack moved this task from Done to Traffic team actively servicing on the Traffic board.Jan 12 2016, 1:26 PM

This task has been "Unbreak now" priority since it was created and has seen no updates for nearly two months.
Is the current priority still realistic?

@Aklapper the bug is solved in the code but needs to be deployed to production, which will happen very soon.

Joe closed this task as Resolved.Feb 9 2016, 2:35 PM

pybal fails to detect dead servers under production lb IPs for port 80Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

pybal fails to detect dead servers under production lb IPs for port 80
Closed, ResolvedPublic
Actions

Related Objects
Search...