Page MenuHomePhabricator

ats-tls performance issues under production load
Closed, InvalidPublic

Description

As shown on https://phabricator.wikimedia.org/P10324, ats-tls has some performance issues. While discussing this on IRC with @BBlack we figured out a nice way of getting a slowlog request on-demand that should help debugging this issue and it could be useful on the future as well:

  1. Create a log format containing the milestones differences included in the ATS slow log: https://github.com/apache/trafficserver/blob/8.0.x/proxy/http/HttpSM.cc#L7008-L7054
  2. Create a log filter matching a header like X-WMF-Slow-Log or something of that kind
  3. Log it to a named pipe

Event Timeline

Vgutierrez moved this task from Backlog to TLS on the Traffic board.

Change 570802 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Provide a slowlog for ats-tls and ats-backend

https://gerrit.wikimedia.org/r/570802

So I had a beautiful CR ready to log this data, but ATS ability of filter logs based on IPs is currently broken, so I just applied the CR manually on cp1075 and retrieved the data, you can find it here: https://phabricator.wikimedia.org/P10346

Change 570904 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.5-1wm15

https://gerrit.wikimedia.org/r/570904

To mitigate the DNS based delay that can be seen on the milestone log, I've backported https://github.com/apache/trafficserver/pull/6332 in https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/570904, I'll test it on Monday and if everything goes as expected, during next week we will have it on all cp hosts as soon as they get upgraded to Buster :)

Mentioned in SAL (#wikimedia-operations) [2020-02-10T12:06:34Z] <vgutierrez> testing ats 8.0.5-1-wm15 on cp4032 - T244538

Mentioned in SAL (#wikimedia-operations) [2020-02-10T12:17:26Z] <vgutierrez> upload trafficserver 8.0.5-1wm15 to apt.wm.o (buster) - T244538

Change 571270 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable DNS resolution on ats-tls

https://gerrit.wikimedia.org/r/571270

Change 570904 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.5-1wm15

https://gerrit.wikimedia.org/r/570904

From #wikimedia-traffic:
  • <Krinkle> vgutierrez: what kind of DNS queries was ats-tls performing though? Surely not the address to localhost varnish-fe?
  • <vgutierrez> the PTR for localhost :(
  • <vgutierrez> localhost being the 10.x main IP address of cpX
  • <vgutierrez> that's why it can be safely disabled
  • <bblack> in general it's the right thing to do. It's a cache and it's expecting upstreams to be various origins with real hostnames
  • <bblack> and you have to follow dns changes for failover events, etc
  • <bblack> it's just that in this particular case, it's all traffic to the same local machine via IP address.
  • <bblack> so there's a flag to disable dns lookups for cases like this. We had it turned on for this case before, but it caused the triggering of a bug in the loadbalancing over the ports.
  • <bblack> now we have fixes for that, so we'll get back to the no-dns-lookups state eventually here

Change 571270 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable DNS resolution on ats-tls

https://gerrit.wikimedia.org/r/571270

Tested again on cp1075 (now running buster) before disabling DNS on ats-tls and after:

before
vgutierrez@cp1075:~$ ./hey -c 1 -z 10s https://en.wikipedia.org/favicon.ico

Summary:
  Total:	10.0166 secs
  Slowest:	0.0785 secs
  Fastest:	0.0005 secs
  Average:	0.0019 secs
  Requests/sec:	530.8187


Response time histogram:
  0.001 [1]	|
  0.008 [4927]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.016 [249]	|■■
  0.024 [94]	|■
  0.032 [28]	|
  0.039 [6]	|
  0.047 [7]	|
  0.055 [1]	|
  0.063 [3]	|
  0.071 [0]	|
  0.078 [1]	|


Latency distribution:
  10% in 0.0006 secs
  25% in 0.0006 secs
  50% in 0.0007 secs
  75% in 0.0007 secs
  90% in 0.0011 secs
  95% in 0.0097 secs
  99% in 0.0222 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0005 secs, 0.0785 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0007 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0004 secs
  resp wait:	0.0017 secs, 0.0004 secs, 0.0783 secs
  resp read:	0.0001 secs, 0.0001 secs, 0.0009 secs

Status code distribution:
  [200]	5317 responses
after
vgutierrez@cp1075:~$ ./hey -c 1 -z 10s https://en.wikipedia.org/favicon.ico

Summary:
  Total:	10.0005 secs
  Slowest:	0.0279 secs
  Fastest:	0.0005 secs
  Average:	0.0007 secs
  Requests/sec:	1349.2330


Response time histogram:
  0.000 [1]	|
  0.003 [13440]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [0]	|
  0.009 [0]	|
  0.011 [0]	|
  0.014 [1]	|
  0.017 [3]	|
  0.020 [38]	|
  0.022 [7]	|
  0.025 [1]	|
  0.028 [2]	|


Latency distribution:
  10% in 0.0006 secs
  25% in 0.0006 secs
  50% in 0.0007 secs
  75% in 0.0007 secs
  90% in 0.0008 secs
  95% in 0.0008 secs
  99% in 0.0010 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0005 secs, 0.0279 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0007 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0001 secs
  resp wait:	0.0006 secs, 0.0004 secs, 0.0278 secs
  resp read:	0.0001 secs, 0.0001 secs, 0.0008 secs

Status code distribution:
  [200]	13493 responses

Mentioned in SAL (#wikimedia-operations) [2020-02-25T17:15:45Z] <vgutierrez> restart ats-tls on cp1075 - T244538

it looks like ats-tls performance gets degraded over time, cp1075 was showing similar values to the ones exhibited before disabling DNS on ats-tls, a service restart seems to fix it.

Change 594670 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Add the client TTFB to ats-tls log

https://gerrit.wikimedia.org/r/594670

Change 594677 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Add atstls.mtail program

https://gerrit.wikimedia.org/r/594677

Change 594670 abandoned by Vgutierrez:
ATS: Add the client TTFB to ats-tls log

Reason:
As pointed out by ema on IRC, we can get this data already from the atskafka log

https://gerrit.wikimedia.org/r/594670

Change 594677 merged by Vgutierrez:
[operations/puppet@production] ATS: Add atstls.mtail program

https://gerrit.wikimedia.org/r/594677

Change 655876 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable JIT compiler on ats-tls too

https://gerrit.wikimedia.org/r/655876

Change 655876 merged by Ema:
[operations/puppet@production] ATS: disable JIT compiler on ats-tls too

https://gerrit.wikimedia.org/r/655876

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall changed the task status from Open to Stalled.Oct 19 2022, 8:45 PM
BCornwall subscribed.

@Vgutierrez Now that HAProxy is used for TLS termination, can this safely be closed?

Closing as per bblack's recommendation.