Right now ats-tls doesn't try to reuse connections between itself and varnish-fe, resulting on a big amount of connections between these two layers, this has been like this for a very long time and we are not longer sure of the side effects.
ATS is having issues handling properly the connect and the TTFB timeout when KA is enabled and parent proxies are being used (ats-tls). This has been reported to upstream as https://github.com/apache/trafficserver/issues/6415
After some investigations, it looks like PR 5811 from upstream could fix the issue, I've backported it as part of 8.0.5-1wm16: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/571869
The issue with timeouts and KeepAlive can be easily understood with a small environment using curl + ATS + httpbin.
- curl requests /delay/20, the request returns successfully after 20 seconds
- curl requests (again) /delay/20, the request returns successfully after 23 seconds
- httpbin sees the following requests:
time="2020-02-12T16:37:41.5787" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.73 time="2020-02-12T16:37:52.2979" status=200 method="GET" uri="/delay/20" size_bytes=0 duration_ms=3717.91 time="2020-02-12T16:38:12.2721" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.50
What's happening? on the second curl request, a timeout is triggered on ATS after ~3 seconds (the connect timeout in the test scenario). In our production environment this means that ats-be has been unnecessarily retrying every request that takes longer than 10 seconds (the connect timeout for ats-be). This has been happening since January 14th
the issue described above it should be fixed almost everywhere:
===== NODE GROUP ===== (76) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4021-4025,4027-4031].ulsfo.wmnet ----- OUTPUT of 'apt-cache policy...r|grep Installed' ----- Installed: 8.0.5-1wm16 ===== NODE GROUP =====
only 2 nodes on ulsfo that are running 8.0.6-rc0 don't have the patch applied