Page MenuHomePhabricator

Investigate side-effects of enabling KA between ats-tls and varnish-fe
Open, MediumPublic

Description

Right now ats-tls doesn't try to reuse connections between itself and varnish-fe, resulting on a big amount of connections between these two layers, this has been like this for a very long time and we are not longer sure of the side effects.

Details

Related Gerrit Patches:
operations/puppet : productioncache: revert Connection:KA probe experiment on cp4026
operations/puppet : productioncache: use Connection:KA for varnish-ats checks
operations/debs/trafficserver : 8.0.6Release 8.0.5-1wm16
operations/puppet : productionATS: Extend KA experiment between ats-tls and varnish-fe to all ulsfo
operations/debs/trafficserver : masterRelease 8.0.5-1wm16
operations/puppet : productionATS: Test KA between ats-tls and varnish-fe on cp4031 + ATS 8.0.5-1wm16
operations/puppet : productionRevert "ATS: Test KA on cp4031 whilst parent proxies are disabled"
operations/puppet : productionATS: Test KA on cp4031 whilst parent proxies are disabled
operations/puppet : productionRevert "Revert "ATS: Disable KA on cp4031""
operations/puppet : productionRevert "ATS: Disable KA on cp4031"
operations/puppet : productionATS: Don't assume that http_settings is mandatory on ats-tls profile
operations/puppet : productionATS: Disable KA on cp4031
operations/puppet : productionvarnish: Sync idle timeout with ats-tls on cp4031
operations/puppet : productionATS: Avoid hardcoding Connection: Close on ats-tls when keepalive is ON
operations/puppet : productionRevert "ATS: Enable KeepAlive for the whole caching cluster in ulsfo"
operations/puppet : productionATS: Log ConnReuse on ats-tls
operations/puppet : productionATS: Enable KeepAlive for the whole caching cluster in ulsfo
operations/puppet : productionATS: Allow server session sharing by ip on ats-tls in cp4031
operations/puppet : productionATS: Allow configuring via hiera server session sharing settings
operations/puppet : productionATS: Enable KA between ats-tls and varnish-fe on cp4031
operations/puppet : productionATS: Allow configuring via hiera KA against origin servers

Event Timeline

Change 570594 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow configuring via hiera KA against origin servers

https://gerrit.wikimedia.org/r/570594

Vgutierrez triaged this task as Medium priority.Thu, Feb 6, 9:58 AM
Vgutierrez moved this task from Triage to Caching on the Traffic board.

Change 570594 merged by Vgutierrez:
[operations/puppet@production] ATS: Allow configuring via hiera KA against origin servers

https://gerrit.wikimedia.org/r/570594

Change 570599 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable KA between ats-tls and varnish-fe on cp4031

https://gerrit.wikimedia.org/r/570599

Change 570599 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable KA between ats-tls and varnish-fe on cp4031

https://gerrit.wikimedia.org/r/570599

Mentioned in SAL (#wikimedia-operations) [2020-02-06T10:19:20Z] <vgutierrez> Enabling HTTP keepalive between ats-tls and varnish-frontend on cp4031 - T244464

Change 570622 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow configuring via hiera server session sharing settings

https://gerrit.wikimedia.org/r/570622

ayounsi removed a subscriber: ayounsi.Thu, Feb 6, 12:46 PM

Change 570622 merged by Vgutierrez:
[operations/puppet@production] ATS: Allow configuring via hiera server session sharing settings

https://gerrit.wikimedia.org/r/570622

Change 570633 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow server session sharing by ip on ats-tls in cp4031

https://gerrit.wikimedia.org/r/570633

Change 570633 merged by Vgutierrez:
[operations/puppet@production] ATS: Allow server session sharing by ip on ats-tls in cp4031

https://gerrit.wikimedia.org/r/570633

Mentioned in SAL (#wikimedia-operations) [2020-02-06T13:22:32Z] <vgutierrez> Enable server session sharing on ats-tls in cp4031 - T244464

Change 571291 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable KeepAlive for the whole caching cluster in ulsfo

https://gerrit.wikimedia.org/r/571291

Change 571294 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Log ConnReuse on ats-tls

https://gerrit.wikimedia.org/r/571294

Change 571291 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable KeepAlive for the whole caching cluster in ulsfo

https://gerrit.wikimedia.org/r/571291

Change 571294 merged by Vgutierrez:
[operations/puppet@production] ATS: Log ConnReuse on ats-tls

https://gerrit.wikimedia.org/r/571294

Mentioned in SAL (#wikimedia-operations) [2020-02-10T14:52:13Z] <vgutierrez> rolling restart of ats-tls in ulsfo - T244464

Change 571311 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Avoid hardcoding Connection: Close on ats-tls when keepalive is ON

https://gerrit.wikimedia.org/r/571311

Change 571459 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: Enable KeepAlive for the whole caching cluster in ulsfo"

https://gerrit.wikimedia.org/r/571459

Change 571459 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: Enable KeepAlive for the whole caching cluster in ulsfo"

https://gerrit.wikimedia.org/r/571459

Mentioned in SAL (#wikimedia-operations) [2020-02-11T10:07:54Z] <vgutierrez> rolling restart of ats-tls in ulsfo - T244464

Change 571472 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: use Connection:KA for varnish-ats checks

https://gerrit.wikimedia.org/r/571472

Change 571311 merged by Vgutierrez:
[operations/puppet@production] ATS: Avoid hardcoding Connection: Close on ats-tls when keepalive is ON

https://gerrit.wikimedia.org/r/571311

Mentioned in SAL (#wikimedia-operations) [2020-02-11T11:20:20Z] <vgutierrez> ats-tls effectively reusing connections between ats-tls and varnish-fe on cp4031 - T244464

Change 571483 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] varnish: Sync idle timeout with ats-tls on cp4031

https://gerrit.wikimedia.org/r/571483

Change 571483 merged by Vgutierrez:
[operations/puppet@production] varnish: Sync idle timeout with ats-tls on cp4031

https://gerrit.wikimedia.org/r/571483

Mentioned in SAL (#wikimedia-operations) [2020-02-11T14:20:31Z] <vgutierrez> restart varnish-fe on cp4031 - T244464

Change 571540 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable KA on cp4031

https://gerrit.wikimedia.org/r/571540

Change 571540 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable KA on cp4031

https://gerrit.wikimedia.org/r/571540

Mentioned in SAL (#wikimedia-operations) [2020-02-11T17:13:15Z] <vgutierrez> Disable KA on cp4031 - T244464

Change 571688 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: Disable KA on cp4031"

https://gerrit.wikimedia.org/r/571688

Change 571688 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: Disable KA on cp4031"

https://gerrit.wikimedia.org/r/571688

Mentioned in SAL (#wikimedia-operations) [2020-02-12T10:32:21Z] <vgutierrez> Enable KA between ats-tls and varnish-fe on cp4031 - T244464

Change 571690 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Don't assume that http_settings is mandatory on ats-tls profile

https://gerrit.wikimedia.org/r/571690

Change 571690 merged by Vgutierrez:
[operations/puppet@production] ATS: Don't assume that http_settings is mandatory on ats-tls profile

https://gerrit.wikimedia.org/r/571690

Change 571719 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "Revert "ATS: Disable KA on cp4031""

https://gerrit.wikimedia.org/r/571719

Change 571719 merged by Vgutierrez:
[operations/puppet@production] Revert "Revert "ATS: Disable KA on cp4031""

https://gerrit.wikimedia.org/r/571719

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:15:52Z] <vgutierrez> disabling KA between ats-tls and varnish-fe on cp4031 - T244464

Vgutierrez changed the task status from Open to Stalled.Wed, Feb 12, 2:03 PM

ATS is having issues handling properly the connect and the TTFB timeout when KA is enabled and parent proxies are being used (ats-tls). This has been reported to upstream as https://github.com/apache/trafficserver/issues/6415

Change 571736 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Test KA on cp4031 whilst disable parent proxies are disabled

https://gerrit.wikimedia.org/r/571736

Change 571736 merged by Vgutierrez:
[operations/puppet@production] ATS: Test KA on cp4031 whilst parent proxies are disabled

https://gerrit.wikimedia.org/r/571736

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:56:16Z] <vgutierrez> Enable KA and disable parent proxies on cp4031 - T244464

Change 571763 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: Test KA on cp4031 whilst parent proxies are disabled"

https://gerrit.wikimedia.org/r/571763

Change 571763 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: Test KA on cp4031 whilst parent proxies are disabled"

https://gerrit.wikimedia.org/r/571763

Mentioned in SAL (#wikimedia-operations) [2020-02-12T17:09:39Z] <vgutierrez> disabling KA between ats-tls and varnish-fe on cp4031 - T244464

Change 571869 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571869

Vgutierrez changed the task status from Stalled to Open.Thu, Feb 13, 7:13 AM

After some investigations, it looks like PR 5811 from upstream could fix the issue, I've backported it as part of 8.0.5-1wm16: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/571869

Change 571876 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Test KA between ats-tls and varnish-fe on cp4031 + ATS 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571876

Change 571876 merged by Vgutierrez:
[operations/puppet@production] ATS: Test KA between ats-tls and varnish-fe on cp4031 + ATS 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571876

Mentioned in SAL (#wikimedia-operations) [2020-02-13T07:49:18Z] <vgutierrez> testing ATS 8.0.5-1wm16 + KA between ats-tls and varnish-fe in cp4031 - T244464

Change 571869 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571869

Mentioned in SAL (#wikimedia-operations) [2020-02-13T11:08:29Z] <vgutierrez> upload trafficserver 8.0.5-1wm16 to apt.wm.o (buster) - T244464

The issue with timeouts and KeepAlive can be easily understood with a small environment using curl + ATS + httpbin.

  1. curl requests /delay/20, the request returns successfully after 20 seconds
  2. curl requests (again) /delay/20, the request returns successfully after 23 seconds
  3. httpbin sees the following requests:
time="2020-02-12T16:37:41.5787" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.73
time="2020-02-12T16:37:52.2979" status=200 method="GET" uri="/delay/20" size_bytes=0 duration_ms=3717.91
time="2020-02-12T16:38:12.2721" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.50

What's happening? on the second curl request, a timeout is triggered on ATS after ~3 seconds (the connect timeout in the test scenario). In our production environment this means that ats-be has been unnecessarily retrying every request that takes longer than 10 seconds (the connect timeout for ats-be). This has been happening since January 14th

Mentioned in SAL (#wikimedia-operations) [2020-02-13T11:18:21Z] <vgutierrez> rolling upgrade of ATS to version 8.0.5-1wm16 fleet wide - T244464

the issue described above it should be fixed almost everywhere:

===== NODE GROUP =====
(76) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4021-4025,4027-4031].ulsfo.wmnet
----- OUTPUT of 'apt-cache policy...r|grep Installed' -----
  Installed: 8.0.5-1wm16
===== NODE GROUP =====

only 2 nodes on ulsfo that are running 8.0.6-rc0 don't have the patch applied

Change 572515 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Extend KA experiment between ats-tls and varnish-fe to all ulsfo

https://gerrit.wikimedia.org/r/572515

Change 572515 merged by Vgutierrez:
[operations/puppet@production] ATS: Extend KA experiment between ats-tls and varnish-fe to all ulsfo

https://gerrit.wikimedia.org/r/572515

Mentioned in SAL (#wikimedia-operations) [2020-02-17T10:20:41Z] <vgutierrez> rolling restart of ats-tls and varnish-fe on ulsfo to enable KA between them - T244464

Change 573220 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@8.0.6] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/573220

Change 573220 merged by Vgutierrez:
[operations/debs/trafficserver@8.0.6] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/573220

Change 571472 merged by Ema:
[operations/puppet@production] cache: use Connection:KA for varnish-ats checks

https://gerrit.wikimedia.org/r/571472

Change 573337 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: revert Connection:KA probe experiment on cp4026

https://gerrit.wikimedia.org/r/573337

Change 573337 merged by Ema:
[operations/puppet@production] cache: revert Connection:KA probe experiment on cp4026

https://gerrit.wikimedia.org/r/573337