Page MenuHomePhabricator

Investigate side-effects of enabling KA between ats-tls and varnish-fe
Closed, ResolvedPublic

Description

Right now ats-tls doesn't try to reuse connections between itself and varnish-fe, resulting on a big amount of connections between these two layers, this has been like this for a very long time and we are not longer sure of the side effects.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+13 -0
operations/puppetproduction+1 -14
operations/puppetproduction+3 -2
operations/puppetproduction+5 -22
operations/puppetproduction+0 -12
operations/puppetproduction+16 -0
operations/debs/trafficserver8.0.6+41 -0
operations/puppetproduction+4 -4
operations/debs/trafficservermaster+41 -0
operations/puppetproduction+3 -3
operations/puppetproduction+3 -4
operations/puppetproduction+4 -3
operations/puppetproduction+3 -3
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/puppetproduction+4 -0
operations/puppetproduction+26 -4
operations/puppetproduction+0 -0
operations/puppetproduction+1 -1
operations/puppetproduction+0 -0
operations/puppetproduction+1 -1
operations/puppetproduction+28 -22
operations/puppetproduction+12 -0
operations/puppetproduction+33 -28
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 570594 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow configuring via hiera KA against origin servers

https://gerrit.wikimedia.org/r/570594

Vgutierrez triaged this task as Medium priority.Feb 6 2020, 9:58 AM
Vgutierrez moved this task from Triage to Caching on the Traffic board.

Change 570594 merged by Vgutierrez:
[operations/puppet@production] ATS: Allow configuring via hiera KA against origin servers

https://gerrit.wikimedia.org/r/570594

Change 570599 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable KA between ats-tls and varnish-fe on cp4031

https://gerrit.wikimedia.org/r/570599

Change 570599 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable KA between ats-tls and varnish-fe on cp4031

https://gerrit.wikimedia.org/r/570599

Mentioned in SAL (#wikimedia-operations) [2020-02-06T10:19:20Z] <vgutierrez> Enabling HTTP keepalive between ats-tls and varnish-frontend on cp4031 - T244464

Change 570622 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow configuring via hiera server session sharing settings

https://gerrit.wikimedia.org/r/570622

Change 570622 merged by Vgutierrez:
[operations/puppet@production] ATS: Allow configuring via hiera server session sharing settings

https://gerrit.wikimedia.org/r/570622

Change 570633 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow server session sharing by ip on ats-tls in cp4031

https://gerrit.wikimedia.org/r/570633

Change 570633 merged by Vgutierrez:
[operations/puppet@production] ATS: Allow server session sharing by ip on ats-tls in cp4031

https://gerrit.wikimedia.org/r/570633

Mentioned in SAL (#wikimedia-operations) [2020-02-06T13:22:32Z] <vgutierrez> Enable server session sharing on ats-tls in cp4031 - T244464

Change 571291 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable KeepAlive for the whole caching cluster in ulsfo

https://gerrit.wikimedia.org/r/571291

Change 571294 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Log ConnReuse on ats-tls

https://gerrit.wikimedia.org/r/571294

Change 571291 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable KeepAlive for the whole caching cluster in ulsfo

https://gerrit.wikimedia.org/r/571291

Change 571294 merged by Vgutierrez:
[operations/puppet@production] ATS: Log ConnReuse on ats-tls

https://gerrit.wikimedia.org/r/571294

Mentioned in SAL (#wikimedia-operations) [2020-02-10T14:52:13Z] <vgutierrez> rolling restart of ats-tls in ulsfo - T244464

Change 571311 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Avoid hardcoding Connection: Close on ats-tls when keepalive is ON

https://gerrit.wikimedia.org/r/571311

Change 571459 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: Enable KeepAlive for the whole caching cluster in ulsfo"

https://gerrit.wikimedia.org/r/571459

Change 571459 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: Enable KeepAlive for the whole caching cluster in ulsfo"

https://gerrit.wikimedia.org/r/571459

Mentioned in SAL (#wikimedia-operations) [2020-02-11T10:07:54Z] <vgutierrez> rolling restart of ats-tls in ulsfo - T244464

Change 571472 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: use Connection:KA for varnish-ats checks

https://gerrit.wikimedia.org/r/571472

Change 571311 merged by Vgutierrez:
[operations/puppet@production] ATS: Avoid hardcoding Connection: Close on ats-tls when keepalive is ON

https://gerrit.wikimedia.org/r/571311

Mentioned in SAL (#wikimedia-operations) [2020-02-11T11:20:20Z] <vgutierrez> ats-tls effectively reusing connections between ats-tls and varnish-fe on cp4031 - T244464

Change 571483 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] varnish: Sync idle timeout with ats-tls on cp4031

https://gerrit.wikimedia.org/r/571483

Change 571483 merged by Vgutierrez:
[operations/puppet@production] varnish: Sync idle timeout with ats-tls on cp4031

https://gerrit.wikimedia.org/r/571483

Mentioned in SAL (#wikimedia-operations) [2020-02-11T14:20:31Z] <vgutierrez> restart varnish-fe on cp4031 - T244464

Change 571540 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable KA on cp4031

https://gerrit.wikimedia.org/r/571540

Change 571540 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable KA on cp4031

https://gerrit.wikimedia.org/r/571540

Change 571688 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: Disable KA on cp4031"

https://gerrit.wikimedia.org/r/571688

Change 571688 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: Disable KA on cp4031"

https://gerrit.wikimedia.org/r/571688

Mentioned in SAL (#wikimedia-operations) [2020-02-12T10:32:21Z] <vgutierrez> Enable KA between ats-tls and varnish-fe on cp4031 - T244464

Change 571690 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Don't assume that http_settings is mandatory on ats-tls profile

https://gerrit.wikimedia.org/r/571690

Change 571690 merged by Vgutierrez:
[operations/puppet@production] ATS: Don't assume that http_settings is mandatory on ats-tls profile

https://gerrit.wikimedia.org/r/571690

Change 571719 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "Revert "ATS: Disable KA on cp4031""

https://gerrit.wikimedia.org/r/571719

Change 571719 merged by Vgutierrez:
[operations/puppet@production] Revert "Revert "ATS: Disable KA on cp4031""

https://gerrit.wikimedia.org/r/571719

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:15:52Z] <vgutierrez> disabling KA between ats-tls and varnish-fe on cp4031 - T244464

Vgutierrez changed the task status from Open to Stalled.Feb 12 2020, 2:03 PM

ATS is having issues handling properly the connect and the TTFB timeout when KA is enabled and parent proxies are being used (ats-tls). This has been reported to upstream as https://github.com/apache/trafficserver/issues/6415

Change 571736 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Test KA on cp4031 whilst disable parent proxies are disabled

https://gerrit.wikimedia.org/r/571736

Change 571736 merged by Vgutierrez:
[operations/puppet@production] ATS: Test KA on cp4031 whilst parent proxies are disabled

https://gerrit.wikimedia.org/r/571736

Mentioned in SAL (#wikimedia-operations) [2020-02-12T15:56:16Z] <vgutierrez> Enable KA and disable parent proxies on cp4031 - T244464

Change 571763 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: Test KA on cp4031 whilst parent proxies are disabled"

https://gerrit.wikimedia.org/r/571763

Change 571763 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: Test KA on cp4031 whilst parent proxies are disabled"

https://gerrit.wikimedia.org/r/571763

Mentioned in SAL (#wikimedia-operations) [2020-02-12T17:09:39Z] <vgutierrez> disabling KA between ats-tls and varnish-fe on cp4031 - T244464

Change 571869 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571869

Vgutierrez changed the task status from Stalled to Open.Feb 13 2020, 7:13 AM

After some investigations, it looks like PR 5811 from upstream could fix the issue, I've backported it as part of 8.0.5-1wm16: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/571869

Change 571876 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Test KA between ats-tls and varnish-fe on cp4031 + ATS 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571876

Change 571876 merged by Vgutierrez:
[operations/puppet@production] ATS: Test KA between ats-tls and varnish-fe on cp4031 + ATS 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571876

Mentioned in SAL (#wikimedia-operations) [2020-02-13T07:49:18Z] <vgutierrez> testing ATS 8.0.5-1wm16 + KA between ats-tls and varnish-fe in cp4031 - T244464

Change 571869 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/571869

Mentioned in SAL (#wikimedia-operations) [2020-02-13T11:08:29Z] <vgutierrez> upload trafficserver 8.0.5-1wm16 to apt.wm.o (buster) - T244464

The issue with timeouts and KeepAlive can be easily understood with a small environment using curl + ATS + httpbin.

  1. curl requests /delay/20, the request returns successfully after 20 seconds
  2. curl requests (again) /delay/20, the request returns successfully after 23 seconds
  3. httpbin sees the following requests:
time="2020-02-12T16:37:41.5787" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.73
time="2020-02-12T16:37:52.2979" status=200 method="GET" uri="/delay/20" size_bytes=0 duration_ms=3717.91
time="2020-02-12T16:38:12.2721" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.50

What's happening? on the second curl request, a timeout is triggered on ATS after ~3 seconds (the connect timeout in the test scenario). In our production environment this means that ats-be has been unnecessarily retrying every request that takes longer than 10 seconds (the connect timeout for ats-be). This has been happening since January 14th

Mentioned in SAL (#wikimedia-operations) [2020-02-13T11:18:21Z] <vgutierrez> rolling upgrade of ATS to version 8.0.5-1wm16 fleet wide - T244464

the issue described above it should be fixed almost everywhere:

===== NODE GROUP =====
(76) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4021-4025,4027-4031].ulsfo.wmnet
----- OUTPUT of 'apt-cache policy...r|grep Installed' -----
  Installed: 8.0.5-1wm16
===== NODE GROUP =====

only 2 nodes on ulsfo that are running 8.0.6-rc0 don't have the patch applied

Change 572515 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Extend KA experiment between ats-tls and varnish-fe to all ulsfo

https://gerrit.wikimedia.org/r/572515

Change 572515 merged by Vgutierrez:
[operations/puppet@production] ATS: Extend KA experiment between ats-tls and varnish-fe to all ulsfo

https://gerrit.wikimedia.org/r/572515

Mentioned in SAL (#wikimedia-operations) [2020-02-17T10:20:41Z] <vgutierrez> rolling restart of ats-tls and varnish-fe on ulsfo to enable KA between them - T244464

Change 573220 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@8.0.6] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/573220

Change 573220 merged by Vgutierrez:
[operations/debs/trafficserver@8.0.6] Release 8.0.5-1wm16

https://gerrit.wikimedia.org/r/573220

Change 571472 merged by Ema:
[operations/puppet@production] cache: use Connection:KA for varnish-ats checks

https://gerrit.wikimedia.org/r/571472

Change 573337 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: revert Connection:KA probe experiment on cp4026

https://gerrit.wikimedia.org/r/573337

Change 573337 merged by Ema:
[operations/puppet@production] cache: revert Connection:KA probe experiment on cp4026

https://gerrit.wikimedia.org/r/573337

Change 577198 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable KA between ats-tls and varnish-fe globally

https://gerrit.wikimedia.org/r/577198

Change 577198 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable KA between ats-tls and varnish-fe globally

https://gerrit.wikimedia.org/r/577198

Mentioned in SAL (#wikimedia-operations) [2020-03-05T10:14:30Z] <vgutierrez> Enable keep alive between ats-tls and varnish-fe globally - T244464

Change 577210 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable parent proxies on ulsfo

https://gerrit.wikimedia.org/r/577210

Change 577210 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable parent proxies on ulsfo

https://gerrit.wikimedia.org/r/577210

Mentioned in SAL (#wikimedia-operations) [2020-03-05T11:10:33Z] <vgutierrez> Disable parent proxies on ats-tls in ulsfo - T244464

Change 578182 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable parent proxies globally

https://gerrit.wikimedia.org/r/578182

Change 578182 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable parent proxies globally

https://gerrit.wikimedia.org/r/578182

Mentioned in SAL (#wikimedia-operations) [2020-03-09T10:04:55Z] <vgutierrez> disable parent proxies globally on ats-tls - T244464

Vgutierrez claimed this task.

KA between ats-tls and varnish-fe is working successfully and enabled globally in the caching cluster for text & upload

Change 666838 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable parent proxies in cp5006

https://gerrit.wikimedia.org/r/666838

Change 666838 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable parent proxies in cp5006

https://gerrit.wikimedia.org/r/666838