Right now ats-tls doesn't try to reuse connections between itself and varnish-fe, resulting on a big amount of connections between these two layers, this has been like this for a very long time and we are not longer sure of the side effects.
|Open||None||T243634 ulsfo varnish-fe vcache processes overflow on FDs|
|Resolved||Vgutierrez||T244464 Investigate side-effects of enabling KA between ats-tls and varnish-fe|
ATS is having issues handling properly the connect and the TTFB timeout when KA is enabled and parent proxies are being used (ats-tls). This has been reported to upstream as https://github.com/apache/trafficserver/issues/6415
After some investigations, it looks like PR 5811 from upstream could fix the issue, I've backported it as part of 8.0.5-1wm16: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/571869
The issue with timeouts and KeepAlive can be easily understood with a small environment using curl + ATS + httpbin.
- curl requests /delay/20, the request returns successfully after 20 seconds
- curl requests (again) /delay/20, the request returns successfully after 23 seconds
- httpbin sees the following requests:
time="2020-02-12T16:37:41.5787" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.73 time="2020-02-12T16:37:52.2979" status=200 method="GET" uri="/delay/20" size_bytes=0 duration_ms=3717.91 time="2020-02-12T16:38:12.2721" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.50
What's happening? on the second curl request, a timeout is triggered on ATS after ~3 seconds (the connect timeout in the test scenario). In our production environment this means that ats-be has been unnecessarily retrying every request that takes longer than 10 seconds (the connect timeout for ats-be). This has been happening since January 14th
the issue described above it should be fixed almost everywhere:
===== NODE GROUP ===== (76) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4021-4025,4027-4031].ulsfo.wmnet ----- OUTPUT of 'apt-cache policy...r|grep Installed' ----- Installed: 8.0.5-1wm16 ===== NODE GROUP =====
only 2 nodes on ulsfo that are running 8.0.6-rc0 don't have the patch applied