Page MenuHomePhabricator

HTTP/1.1 keepalive for local nginx->varnish conns
Closed, DeclinedPublic

Description

Our current nginx config uses a separate HTTP/1.0 TCP connection for each request from nginx->varnish. Theses happen over the loopback interface so there's virtually zero latency impact from the handshakes, but it would be more efficient in general to use HTTP/1.1 keepalives, and reduce the bloat of TIME_WAIT sockets on the servers.

Typical socket states for nginx->varnish local connections currently (cp1065 has 32 worker threads, and each would have at most one connection established at any given moment, constantly breaking and re-making it):

root@cp1065:~# netstat -an|grep '10.64.0.102:80[        ]*10.64.0.102'|awk '{print $6}'|sort|uniq -c
     32 ESTABLISHED
      9 FIN_WAIT2
  63688 TIME_WAIT

With keepalive patch manually applied (each of the 32 worker processes spawns parallel keepalive connections as necessary, with a maximum of 4x idle keepalive connections per process before it starts pruning them - so we'd expect 128 ESTABLISHED minimum at all times, and far fewer TIME_WAIT):

root@cp1065:~# netstat -an|grep '10.64.0.102:80[        ]*10.64.0.102'|awk '{print $6}'|sort|uniq -c
    144 ESTABLISHED
      2 FIN_WAIT2
   2053 TIME_WAIT

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -0
operations/puppetproduction+0 -11
operations/puppetproduction+0 -4
operations/puppetproduction+2 -2
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/puppetproduction+9 -7
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -2
operations/puppetproduction+1 -0
operations/puppetproduction+7 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+2 -1
operations/puppetproduction+1 -3
operations/puppetproduction+1 -1
operations/puppetproduction+0 -3
operations/puppetproduction+1 -0
operations/puppetproduction+23 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+14 -0
operations/puppetproduction+10 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 269708 had a related patch set uploaded (by BBlack):
tlsproxy: nginx keepalives param for testing

https://gerrit.wikimedia.org/r/269708

Change 269708 merged by BBlack:
tlsproxy: nginx keepalives param for testing

https://gerrit.wikimedia.org/r/269708

I'm live-testing this (at a setting of 1) on an eqiad text machine (cp1065) right now. The connection/thread stuff I'm watching is the output of:

root@cp1065# echo -n "inbound TLS estab: "; ss -tan state established|egrep '208.80.154.224:443'|wc -l; echo -n "nginx->varnish-fe estab: "; ss -tan state established|egrep '10.64.0.102:80\s+10.64.0.102:' |wc -l; echo -n "varnish-fe live threads: "; ps -efL|grep 'wikimedia_text-fronten[d]\.vcl'|wc -l; echo -n "nginx->varnish-fe time-wait: "; ss -tan state time-wait|egrep '10.64.0.102:80\s+10.64.0.102:' |wc -l; uptime

In cp1065's normal unaltered state, this shows numbers about like these (bouncing around a bit, other than varnish-fe live threads):

inbound TLS estab: 30453
nginx->varnish-fe estab: 42
varnish-fe live threads: 550
nginx->varnish-fe time-wait: 63868
 16:28:04 up 1 day, 22:45,  1 user,  load average: 4.53, 4.78, 4.84

After reloading with 1x keepalive setting and waiting for stats to settle in a little, it's more like:

inbound TLS estab: 31821
nginx->varnish-fe estab: 75
varnish-fe live threads: 550
nginx->varnish-fe time-wait: 1842
 16:33:35 up 1 day, 22:51,  1 user,  load average: 4.73, 4.55, 4.72

No spikes in 5xx data or funky log messages that I've observed so far. Ganglia cpu% looks very slightly reduced, but too small to be sure or care much.

The above (if it holds) basically confirms earlier thinking: that this is the "right" thing to do and helps, but we've got upload-cluster-specific issues that caused us to revert earlier (due to upload's longer transaction lengths and thus higher inherent parallelism, + limits on inbound varnish-fe conns + lack of limits on nginx keepalive connection parallelism). I think after a bit more live testing on cp1065, we should perhaps go ahead and roll this out for all clusters except 'upload', and then look at upload's problems a bit deeper...

Should also note: the initial test of this was with keepalive's idle conns parameter set to 4, whereas the new tests are at 1. That alone may improve the situation on upload, but we may still have to also muck with thread pool parameters there to get this working, and see if it's even feasible without nginx outbound keepalive upper limits.

BBlack changed the task status from Open to Stalled.Feb 11 2016, 6:28 PM

Concluding the test on cp1065 and returning it to its normal config. The stats were stable, and IMHO we should move on deploying this to non-upload clusters. Needs to wait for other experiments to conclude (e.g. SPDY) so we get some isolation of any effects on perf graphs, though.

Change 271317 had a related patch set uploaded (by BBlack):
nginx keepalives: enable for maps misc

https://gerrit.wikimedia.org/r/271317

Change 271318 had a related patch set uploaded (by BBlack):
nginx keepalives: enable for text

https://gerrit.wikimedia.org/r/271318

Change 271317 merged by BBlack:
nginx keepalives: enable for maps misc

https://gerrit.wikimedia.org/r/271317

Change 271318 merged by BBlack:
nginx keepalives: enable for text

https://gerrit.wikimedia.org/r/271318

Change 271350 had a related patch set uploaded (by BBlack):
enable tcp_tw_reuse on caches

https://gerrit.wikimedia.org/r/271350

Change 271350 merged by BBlack:
enable tcp_tw_reuse on caches

https://gerrit.wikimedia.org/r/271350

Change 271429 had a related patch set uploaded (by BBlack):
nginx keepalives: enable for upload

https://gerrit.wikimedia.org/r/271429

I'm enabling this for upload now as well, as I've been testing one esams cache with live hacks for a while now and not seen any issues. Will try to keep an eye on 502s (which tend to show up in pybal if they're showing up for users).

The primary diffs vs when this didn't work back in August are:

  1. keepalive idle conns set to 1 instead of 4 (fewer baseline open connections to varnish)
  2. tcp_tw_reuse set in https://gerrit.wikimedia.org/r/271350 (most likely what fixed it)
  3. Possibly ulimit diff in https://gerrit.wikimedia.org/r/#/c/271423/ (I highly doubt this, but it's still more-correct this way).

Change 271429 merged by BBlack:
nginx keepalives: enable for upload

https://gerrit.wikimedia.org/r/271429

resolving for now, unless something new pops up

See T127931 - caused elevated (but still small) 503s on cache_text, reverted for now, will revisit after further varnish/nginx software upgrades, etc.

Change 273012 had a related patch set uploaded (by BBlack):
disable keepalives everywhere

https://gerrit.wikimedia.org/r/273012

Change 273012 merged by BBlack:
disable keepalives everywhere

https://gerrit.wikimedia.org/r/273012

Change 273013 had a related patch set uploaded (by BBlack):
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

Change 273013 merged by BBlack:
Revert "cache_misc: disable do_gzip"

https://gerrit.wikimedia.org/r/273013

I've been reviewing and re-testing a bunch of related things today. There are several inter-mixed issues and I'm not even going to try to separate them all out into separate tasks and updates really. These are the reasons I'm looking again:

  • T134870 (websockets for cache_misc) - needs at least proxy_http_version 1.1; from nginx, plus other things. Switching to 1.1 was also part of the keepalive work for related reasons
  • T134870 (websockets for cache_misc) - will probably also need (or at least want, but I think need) proxy_request_buffering off;
  • Disabling request buffering, I suspect, will not interact well with our HTTP/1.0 non-keepalive proxy conns, as lack of request buffering would slow each request flow down, and they're contending for limited slots (1/worker) to talk to varnish.
  • We've recently upgraded our nginx package by several versions, and one of the included bugfixes was related to the keepalive parameter we've tried to use here in the past.
  • I suspect lack of max_fails=0 on our singular upstream server line in the previous keepalive config attempts may have played a role (but I'm really not sure).
  • There may be some negative interaction (perf and/or functional) between buffered requests and HTTP/2, at least for some requests from some clients. It's possible this is interrelated with T134869,
  • T127294 - The most-critical reason we rolled back last time was strange behavior on cache_misc when combining nginx keepalive with do_stream + do_gzip on cache_misc. Since then, we turned on do_stream for cache_text as well, and discovered a nasty trail of bug stuff in T133866, which is also showing that varnish is buggy for do_stream + do_gzip and non-gzip clients regardless of nginx (although nginx may have exacerbated). We fixed that by killing varnish3 do_stream on both cache_misc and cache_text, which may well have incidentally resolved the nginx keepalive + cache_misc issues noted in T127294.

For all of these reasons, I've been experimenting with keepalive, request buffering, and all related things today. Mostly I experimented on cp1065 and cp1071 with live hacks while watching request logs, 5xx graphs/logs, and connection-state counting tools on the hosts themselves. These are the basic results and where things have been left off at:

  1. After a long period of manual testing, I've merged and deployed to all: https://gerrit.wikimedia.org/r/287940 + https://gerrit.wikimedia.org/r/287995 . The net effect of these is very minimal in practice, but they switch our known-working, un-keepalive configuration to using the upstream module, and switch the protocol of the un-keepalive connections to varnish from HTTP/1.0 to HTTP/1.1. There should be little risk of fallout from this, but I'd like to leave all clusters in this state a while before going further, to confirm sanity.
  2. Testing keepalives manually on top of that, the results looked very positive. I don't think there's any disfunction in turning on keepalives now, and we do get the expected increases in client concurrency. I played with tuning while I was looking at this, and reasonably-close-ish keepalive_workers values seem to be ~4 for text and ~8 for upload. There's more risk in this, though, so I'd like to get some isolation from (1) before we look for fallout on graphs from turning this on.
  3. Tested turning off request buffering while keepalives are enabled, and the results look positive: no big changes in connection counts under keepalive, and doesn't seem to spike up errors. Since this is a separate risk, it should go out after we're comfortable with the above.

And of course, there are still elevated random 503's on text, like before. Need to confirm if it's unrelated and coincidental (unlikely), or which of the two changes (HTTP/1.1, upstream module) cause it.

Change 288089 had a related patch set uploaded (by BBlack):
Revert "tlsproxy: switch to (non-persistent) HTTP/1.1"

https://gerrit.wikimedia.org/r/288089

Change 288089 merged by BBlack:
Revert "tlsproxy: switch to (non-persistent) HTTP/1.1"

https://gerrit.wikimedia.org/r/288089

FWIW, the random 503s look like this on GET of plain article URLs (and other things, of course):

421 VCL_return   c hash
421 VCL_call     c miss fetch
421 Backend      c 96 appservers be_appservers
421 FetchError   c http format error
421 VCL_call     c error deliver
421 VCL_call     c deliver deliver
421 TxProtocol   c HTTP/1.1
421 TxStatus     c 503
421 TxResponse   c Service Unavailable
421 TxHeader     c Server: Varnish
421 TxHeader     c Content-Type: text/html; charset=utf-8
421 TxHeader     c Content-Length: 1624
421 TxHeader     c Accept-Ranges: bytes
421 TxHeader     c Date: Tue, 10 May 2016 21:09:20 GMT
421 TxHeader     c X-Varnish: 612169238
421 TxHeader     c Age: 0
421 TxHeader     c Via: 1.1 varnish
421 TxHeader     c Connection: close
421 TxHeader     c X-Cache: cp1065 miss(0)
421 Length       c 1624
421 ReqEnd       c 612169238 1462914560.392444134 1462914560.392889261 0.089637280 0.000409365 0.000035763

Looking into the Varnish3 code, it seems like http format error should only happen if it can't parse the basic response line (e.g. HTTP/1.1 200 OK).

Reverting just the HTTP/1.1 nginx patch makes the 503s go away (still using upstream module).... needs more digging in more-isolated testing....

So, recapping the strange evidence above, because I've been thinking about this off and on all night and it still makes no sense:

  • The only change involved was to add two lines to the nginx proxy statements: proxy_http_version 1.1; and proxy_set_header Connection close;. The first should have almost no effect in our configuration except switching the request line to end in HTTP/1.1 instead of HTTP/1.0. The latter is supposedly the default anyways (regardless of proto version). I was putting it there to be explicit, and so that it's visually obvious what's going on in future patches that toggle other settings for the Connection header when we turn on other features.
  • Nothing else is being toggled here: we're not re-using connections with keepalives, and stats debugging on connection states/counts seems to confirm that.
  • When we turn on this protocol version change, most things still seem sane, but our global 503 error rate on the text caches rises from nearly-zero to roughly 5/sec (which is a small fraction of overall traffic, even if you only count cache misses). The requests that hit the 503 seemed fairly random at a glance in terms of UA, URL, etc.
  • The 503s are happening when the backend-most varnish contacts the appserver, and the error shown is NNN FetchError c http format error, which means varnish is failing to parse a legitimate HTTP response status line from the application server (HTTP/1.1 200 OK and such).

How does switching the protocol line from HTTP/1.0 to HTTP/1.1 when nginx makes a request to varnish-fe, end up causing a 503 due to malformed status line in the response from the applayer backend when the varnish-be process contacts the applayer? Keep in mind by then, the nginx request has already been successfully parsed and amended by 2x layers of varnish, and sent to the applayer, and now the applayer response seems junky. It makes no basic sense.

The only few semi-bright ideas I have at present are this:

  • We can try leaving off the supposedly-default 'proxy_set_header Connection close;`, in case the default behavior is more-subtle than documented.
  • We can try looking into the code for and playing with the settings for nginx's proxy_ignore_client_abort in case that's related.
  • It may be that we're not going to get correct behavior until we do both the 1.1 change and the keepalives (no Connection: close) together. I don't think this is likely, but at some point it may be worth trying.

Trying a patch that will test the first two ideas above...

Change 288177 had a related patch set uploaded (by BBlack):
tlsproxy: protocol 1.1 ignore_client_abort

https://gerrit.wikimedia.org/r/288177

Change 288177 merged by BBlack:
tlsproxy: protocol 1.1 ignore_client_abort

https://gerrit.wikimedia.org/r/288177

Change 288179 had a related patch set uploaded (by BBlack):
tlsproxy: remove ignore_client_abort

https://gerrit.wikimedia.org/r/288179

Change 288179 merged by BBlack:
tlsproxy: remove ignore_client_abort

https://gerrit.wikimedia.org/r/288179

Change 288182 had a related patch set uploaded (by BBlack):
tlsproxy: remove protocol 1.1

https://gerrit.wikimedia.org/r/288182

Change 288182 merged by BBlack:
tlsproxy: remove protocol 1.1

https://gerrit.wikimedia.org/r/288182

I'm out of good ideas for now. I've tried both of the obvious alternatives (1.1 w/o explicit Connection setting, and 1.1 w/o explicit Connection setting but with ignore_client_abort), and the behavior is unchanged: we still get a low but notable rate of 503s as described earlier.

Change 288213 had a related patch set uploaded (by BBlack):
tlsproxy: use http/1.1 only in conjunction with varnish4

https://gerrit.wikimedia.org/r/288213

Change 288213 merged by BBlack:
tlsproxy: use http/1.1 only in conjunction with varnish4

https://gerrit.wikimedia.org/r/288213

Change 294301 had a related patch set uploaded (by BBlack):
cache_maps: turn on keepalives

https://gerrit.wikimedia.org/r/294301

Change 294301 merged by BBlack:
cache_maps: turn on keepalives

https://gerrit.wikimedia.org/r/294301

Change 294305 had a related patch set uploaded (by BBlack):
disable keepalives on maps misc

https://gerrit.wikimedia.org/r/294305

Change 294305 merged by BBlack:
disable keepalives on maps misc

https://gerrit.wikimedia.org/r/294305

nginx has added max_conns to the open source master branch in http://hg.nginx.org/nginx/rev/29bf0dbc0a77 , which should appear in 1.11.5.

That wasn't our only keepalives blocker (the other is varnish4, and there are possibly others related to backend bugs in MW/apache/hhvm), but having it makes this easier to manage.

However, because there will always be corner cases with keepalives for nginx->varnish where HTTP bugs can "leak" between disparate varnish backends, I'd still rather see us pursue unix domain sockets without keepalives as a superior alternative.

Either that or move towards using a purer inbound TLS proxy that operates at the PROXY or TCP level with some solution for client IP forwarding and SNI forwarding, allowing varnish to handle all HTTP protocol details. However, at this point even if we had such a solution we'd want to block that on varnish5 (which isn't that far off and could've just as easily been called varnish4.2) stabilizing/improving their nascent HTTP/2 support as well, so we don't regress there. The TLS terminator would need to support all of the important features we're using in nginx as well (OCSP Stapling, Dual ECC/RSA certs, etc).

Change 317405 had a related patch set uploaded (by BBlack):
tlsproxy: use 8x FE ports to balance

https://gerrit.wikimedia.org/r/317405

Change 317404 had a related patch set uploaded (by BBlack):
cache frontends: 8x local ports 3120-3127

https://gerrit.wikimedia.org/r/317404

Change 317414 had a related patch set uploaded (by BBlack):
tlsproxy: raise worker connection limits, too

https://gerrit.wikimedia.org/r/317414

Change 317499 had a related patch set uploaded (by BBlack):
reduce cache local ports slightly

https://gerrit.wikimedia.org/r/317499

Change 317404 merged by BBlack:
cache frontends: 8x local ports 3120-3127

https://gerrit.wikimedia.org/r/317404

Mentioned in SAL (#wikimedia-operations) [2016-10-24T14:36:32Z] <bblack> disabling puppet on all caches ahead of port# work, to test - T107749 / https://gerrit.wikimedia.org/r/#/c/317405

Change 317499 merged by BBlack:
reduce cache local ports slightly

https://gerrit.wikimedia.org/r/317499

Change 317414 merged by BBlack:
tlsproxy: raise worker connection limits, too

https://gerrit.wikimedia.org/r/317414

Change 317405 merged by BBlack:
tlsproxy: use 8x FE ports to balance

https://gerrit.wikimedia.org/r/317405

Change 317514 had a related patch set uploaded (by BBlack):
bugfix for e23f699fe2

https://gerrit.wikimedia.org/r/317514

Mentioned in SAL (#wikimedia-operations) [2016-10-24T15:04:16Z] <bblack> enabling/running puppet on caches for 8x varnish ports changes - T107749

Change 317528 had a related patch set uploaded (by BBlack):
tlsproxy: http2: limit concurrency to 64 (was 128)

https://gerrit.wikimedia.org/r/317528

Change 317528 merged by BBlack:
tlsproxy: http2: limit concurrency to 64 (was 128)

https://gerrit.wikimedia.org/r/317528

Change 317543 had a related patch set uploaded (by BBlack):
http2 concurrency: 32

https://gerrit.wikimedia.org/r/317543

Change 317544 had a related patch set uploaded (by BBlack):
http2 concurrency: back to defaults

https://gerrit.wikimedia.org/r/317544

Change 317544 merged by BBlack:
http2 concurrency: back to defaults

https://gerrit.wikimedia.org/r/317544

Should have resolved/rejected this back in T107749#2662491 - at this point it's just a collector of semi-related commits, but I don't think we plan to actually turn on keepalives at this layer, with this software stack. It's too risky.

Change 323865 had a related patch set uploaded (by BBlack):
tlsproxy: remove unused keepalives complexity

https://gerrit.wikimedia.org/r/323865

Change 323866 had a related patch set uploaded (by BBlack):
tlsproxy: be explicit about Conn:close

https://gerrit.wikimedia.org/r/323866

Change 323865 merged by BBlack:
tlsproxy: remove unused keepalives complexity

https://gerrit.wikimedia.org/r/323865

Change 323866 merged by BBlack:
tlsproxy: be explicit about Conn:close

https://gerrit.wikimedia.org/r/323866