Page MenuHomePhabricator

15% response start regression as of 2019-11-11 (Varnish->ATS)
Open, HighPublic

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+9 -8
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
operations/puppetproduction+21 -0
operations/puppetproduction+12 -2
operations/puppetproduction+22 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+16 -9
operations/puppetproduction+5 -0
operations/puppetproduction+92 -15
operations/puppetproduction+21 -19
operations/puppetproduction+26 -6
operations/puppetproduction+24 -0
operations/puppetproduction+41 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+22 -7
operations/puppetproduction+39 -16
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
operations/puppetproduction+1 -1
operations/puppetproduction+9 -0
operations/puppetproduction+21 -8
operations/puppetproduction+11 -2
operations/puppetproduction+2 -0
operations/puppetproduction+41 -7
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
operations/puppetproduction+6 -40
operations/puppetproduction+8 -3
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 554483 merged by Ema:
[operations/puppet@production] ATS: add milestone timing logs, update TTFetchHeaders

https://gerrit.wikimedia.org/r/554483

Change 554503 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add atsbackendttfb.mtail

https://gerrit.wikimedia.org/r/554503

Change 554503 merged by Ema:
[operations/puppet@production] ATS: add client<->ats-be interactions metrics

https://gerrit.wikimedia.org/r/554503

Change 554558 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: allow to toggle request coalescing

https://gerrit.wikimedia.org/r/554558

Change 554570 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add trafficserver_backend_client_requests_total to mtail

https://gerrit.wikimedia.org/r/554570

Change 554558 merged by Ema:
[operations/puppet@production] ATS: allow to toggle request coalescing

https://gerrit.wikimedia.org/r/554558

Change 554808 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable text@esams origin server request coalescing

https://gerrit.wikimedia.org/r/554808

Change 554808 merged by Ema:
[operations/puppet@production] ATS: disable text@esams origin server request coalescing

https://gerrit.wikimedia.org/r/554808

Mentioned in SAL (#wikimedia-operations) [2019-12-05T09:54:17Z] <ema> text@esams: disable ats-be origin server request coalescing T238494

Change 554863 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add varnish-fe <-> ats-be TTFB histogram

https://gerrit.wikimedia.org/r/554863

Change 554570 abandoned by Ema:
ATS: add trafficserver_backend_client_requests_total to mtail

Reason:
Made obsolete by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /554863/

https://gerrit.wikimedia.org/r/554570

Change 554863 merged by Ema:
[operations/puppet@production] ATS: add varnish-fe <-> ats-be TTFB histogram

https://gerrit.wikimedia.org/r/554863

ema added a comment.EditedDec 6 2019, 10:04 AM

As a way to identify more specifically where the TTFB regression comes from, in particular to understand precisely how much ats-be contriubtes to it, I thought of analyzing TTFB as seen by varnish-fe when acting as a client (that is: when varnish-fe fetches objects from its backends, which can be either ats-be or varnish-be).

To this end I have gathered 60 seconds of fetches initiated by varnish-fe (miss/pass traffic from varnish-fe POV, in other words: non-hits at the frontend layer) and printed TTFB as well as X-Cache-Int. Notice the -b argument, "only backend requests trigger log lines".

timeout 60 varnishncsa -n frontend -b -F '%{Varnish:time_firstbyte}x %{X-Cache-Int}o' > x.log

For each cache_text backend in esams I have then computed the median, p75 and p95 TTFB in milliseconds:

for host in cp3050 cp3052 cp3054 cp3056 cp3058 cp3060 cp3062 cp3064 ; do
    printf "$host "; sed -E 's/(hit|miss|pass|,|\/[0-9]+)//g' x.log | awk '{print $1*1000, $NF}' |
        grep $host | sort -n | awk '{v[NR]=$1} END {print int(v[int(NR*.5)]), int(v[int(NR*.75)]), int(v[int(NR*.95)])}'
done

The results follow; cp3064 is the only varnish-be, and the last column shows the p75 difference between ATS and Varnish, which is consistently ~60/70ms. If we add the ~40ms introduced by the TLS cert change T238494#5711289, I think we get pretty close to the whole responseStart regression tracked here.

hostp50p75p95softwarep75 diff
cp3050163252624ats69
cp3052165251480ats68
cp3054160248468ats65
cp3056162247473ats64
cp3058164254482ats71
cp3060162255514ats72
cp3062158244469ats61
cp3064118183381varnish0

The results above include both backend miss/pass and backend hits, but there are also very large differences when it comes to backend hits (we filter with grep hit, not perfect as that also includes remote hits for cp3064, but good enough given that it's p75):

for host in cp3050 cp3052 cp3054 cp3056 cp3058 cp3060 cp3062 cp3064 ; do
    printf "|$host "; grep hit x.log | sed -E 's/(hit|miss|pass|,|\/[0-9]+)//g'  | awk '{print $1*1000, $NF}' |
        grep $host | sort -n | awk '{v[NR]=$1} END {print "|", v[int(NR*.5)], "|", v[int(NR*.75)], "|", v[int(NR*.95)]}'; done
hostp50p75p95softwarep75 diff
cp30509.14812.26462.599ats11.729
cp30529.41812.51851.579ats11.983
cp30548.83512.00854.231ats11.473
cp30568.89813.7553.285ats13.215
cp30588.97811.73648.262ats11.201
cp30608.79212.05350.347ats11.518
cp30628.8412.20848.361ats11.673
cp30640.3510.5350.875varnish0

The next step is to understand why ATS introduces this visible performance penalty. As the lowest-possible hanging fruit, I have tried logging on cp3050 TS_MILESTONE_PLUGIN_ACTIVE (time running plugin code) and TS_MILESTONE_PLUGIN_TOTAL (time spent in or waiting for plugins). See the docs for TSHttpTxnMilestoneGet. Out of 120k requests, only 3 had either PLUGIN_ACTIVE or PLUGIN_TOTAL greater than 0 (two requests had 1ms for both, and one had 2ms). This seems to indicate that plugins, including Lua, are not to blame.

Could you generate a separate stats table for misses and passthroughs?

ema added a comment.Dec 6 2019, 10:25 AM

Could you generate a separate stats table for misses and passthroughs?

Certainly. Non-hits (filtered with grep -v hit):

hostp50p75p95softwarep75 diff
cp3050184271675ats42
cp3052187271507ats42
cp3054180268498ats39
cp3056183268517ats39
cp3058187272525ats43
cp3060185276554ats47
cp3062178261509ats32
cp3064159229492varnish0

I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthroughs (ATS/Varnish merely acted as a proxy), separately. One table for each. Is it possible to separate?

ema added a comment.Dec 6 2019, 10:38 AM

I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthroughs (ATS/Varnish merely acted as a proxy), separately. One table for each. Is it possible to separate?

Not for ATS, at the moment they're all reported as miss. I was planning on fixing that eventually though, and this seems the right time. I assume you're asking to check if cache lookups can contribute to the performance penalty? We do log the precise CacheReadTime, so that's definitely something we can analyze.

Yes, from a cache application perspective they are different tasks and therefore the issues affecting each could have different causes. If the overhead is high for passthrough as well, it's particularly suspicious, as ATS shouldn't be doing anything else than running a regex to figure out that it's a passthrough path and then proxying things transparently, right? Or does it do some kind of decompression/recompression or any type of processing for passthrough requests?

Change 555396 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: mark uncacheable responses as 'pass' in X-Cache-Int

https://gerrit.wikimedia.org/r/555396

Change 553132 merged by Ema:
[operations/puppet@production] ATS: pass uncacheable requests

https://gerrit.wikimedia.org/r/553132

Mentioned in SAL (#wikimedia-operations) [2019-12-06T13:41:47Z] <ema> cp2004: adding do_global_ doesn't seem to work with reload, restart ats-be T238494

Change 555396 abandoned by Ema:
ATS: mark uncacheable responses as 'pass' in X-Cache-Int

Reason:
The idea here is wrong. Thanks to hit-for-pass, Varnish can skip cache lookups and coalescing for responses that are known to be uncacheable. In ATS hfp does not exist, hence it would be wrong to label as "pass" things for which we did lookup the cache.

https://gerrit.wikimedia.org/r/555396

ema added a comment.Dec 6 2019, 4:45 PM

We can now distinguish between hit, miss, and pass in text@esams ATS too.

An important caveat when looking at these numbers is that Varnish supports hit-for-pass, essentially allowing to pass much more traffic by remembering that responses to certain requests were uncacheable (notably, those with Cache-Control: private/no-cache). ATS does not support such feature, hence we can decide to pass solely based on the request. This explains the difference between the number of pass requests (column "pass n") and miss ("miss n"), the former being much higher on Varnish and misses being higher on ATS.

hosthit p75hit diffhit nmiss p75miss diffmiss npass p75pass diffpass n
cp305011.81711.359944377.97873.855569398.128144.747937
cp305212.19311.735886350.6546.5225478380.078126.697879
cp305411.97111.513731363.51259.3844547397.028143.647846
cp305614.9114.452784382.74678.6184782412.411159.03794
cp305812.34511.887877368.43664.3085811407.02153.639958
cp306011.84911.391755357.01552.8874787393.6140.219878
cp306211.74911.291763348.744.5724969377.698124.317816
cp30640.45802647304.12801539253.38103778

Some initial observations:

  • On cp3064 (Varnish) things look reasonable: cache hits are very cheap, misses are the most expensive (cache lookup needed) and pass is cheaper than miss
  • On ATS hits are cheaper than anything else (luckily!) but there's a ~11ms slowdown compared to Varnish which may or may not be something to worry about at this stage given the low hitrate at the backend layer
  • Pass is more expensive than miss (!) and both introduce a significant regression compared to Varnish

Next week I am planning on studying the overhead introduced by ATS, and I was thinking of defining a new metric called proxy_overhead = client_ttfb - origin_ttfb, with client_ttfb being the number of milliseconds between the moment when ATS starts processing the request from varnish-fe and when the first byte of response to varnish-fe is written. origin_ttfb, instead, is the amount of time between the moment when ATS first tries to contact the origin (or recycle a connection) and when the first byte from the origin is received. The difference between these two, I think, should be a useful indicator of how much overhead ATS (or a proxy in general) is introducing.

ema added a comment.EditedDec 9 2019, 7:58 AM

Analysis repeated right now capturing requests for 60s. The numbers don't look as bad. p75 TTFB in milliseconds:

hosthitmisspass
cp3062 (ATS)12.385266.964226.537
cp3064 (Varnish)0.421268.309205.125
diff11.964-1.321

We should add this data to prometheus to try draw useful conclusions, see T240180.

Gilles added a comment.Dec 9 2019, 8:37 AM

We're still seeing an extra 100-150ms on the p75 TTFB reported by clients in Europe compared to before 11/11. Only 20ms of which can be attributed to TLS.

Change 555909 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: add origin server logging support

https://gerrit.wikimedia.org/r/555909

Change 555909 merged by Ema:
[operations/puppet@production] varnishmtail: add origin server logging support

https://gerrit.wikimedia.org/r/555909

Change 555930 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: add varnishttfb.mtail

https://gerrit.wikimedia.org/r/555930

Change 555930 merged by Ema:
[operations/puppet@production] varnishmtail: add varnishttfb.mtail

https://gerrit.wikimedia.org/r/555930

Change 555953 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: install varnishttfb.mtail

https://gerrit.wikimedia.org/r/555953

Change 555953 merged by Ema:
[operations/puppet@production] varnishmtail: install varnishttfb.mtail

https://gerrit.wikimedia.org/r/555953

Change 556139 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: new buckets for varnishttfb, use seconds for sum

https://gerrit.wikimedia.org/r/556139

Change 556139 merged by Ema:
[operations/puppet@production] varnishmtail: new buckets for varnishttfb, use seconds for sum

https://gerrit.wikimedia.org/r/556139

BBlack added a subscriber: mark.Dec 12 2019, 3:24 PM

Change 556973 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: enable xdebug plugin on 3 hosts

https://gerrit.wikimedia.org/r/556973

Change 556974 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: log origin Transfer-Encoding

https://gerrit.wikimedia.org/r/556974

Change 556973 merged by Ema:
[operations/puppet@production] ATS: enable xdebug plugin on 3 hosts

https://gerrit.wikimedia.org/r/556973

Change 556974 merged by Ema:
[operations/puppet@production] ATS: log origin Transfer-Encoding

https://gerrit.wikimedia.org/r/556974

Change 557031 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: assign 8G instead of 2G to RAM caches on ats-be

https://gerrit.wikimedia.org/r/557031

Change 557039 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: log Connection response header from origins

https://gerrit.wikimedia.org/r/557039

Change 557039 merged by Ema:
[operations/puppet@production] ATS: log Connection response header from origins

https://gerrit.wikimedia.org/r/557039

Change 557066 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add SystemTap probe to trace session teardown

https://gerrit.wikimedia.org/r/557066

Change 557031 merged by Ema:
[operations/puppet@production] ATS: assign 8G instead of 2G to RAM caches on ats-be

https://gerrit.wikimedia.org/r/557031

Mentioned in SAL (#wikimedia-operations) [2019-12-16T08:54:04Z] <ema> cp1077: ats-backend-restart to increase RAM cache size T238494

Change 557883 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: allow to configure server_session_sharing.pool

https://gerrit.wikimedia.org/r/557883

Change 557066 merged by Ema:
[operations/puppet@production] ATS: add SystemTap probe to trace session teardown

https://gerrit.wikimedia.org/r/557066

Mentioned in SAL (#wikimedia-operations) [2019-12-16T13:33:56Z] <ema> cp-ats: rolling ats-backend-restart to apply ram cache size changes T238494

Change 558048 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: separate wikidata sessions from others

https://gerrit.wikimedia.org/r/558048

Change 558089 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable compress plugin in text@ulsfo

https://gerrit.wikimedia.org/r/558089

Change 557883 merged by Ema:
[operations/puppet@production] ATS: allow to configure server_session_sharing.pool

https://gerrit.wikimedia.org/r/557883

Change 558048 merged by Ema:
[operations/puppet@production] ATS: separate wikidata sessions from others

https://gerrit.wikimedia.org/r/558048

Change 558089 merged by Ema:
[operations/puppet@production] ATS: disable compress plugin in text@ulsfo

https://gerrit.wikimedia.org/r/558089

Change 558970 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable debug mode in cp3050

https://gerrit.wikimedia.org/r/558970

Change 558970 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable debug mode in cp3050

https://gerrit.wikimedia.org/r/558970

Change 558984 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: increase keep_alive_no_activity_timeout_out on ats-be

https://gerrit.wikimedia.org/r/558984

Change 558985 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mediawiki::webserver: increase TLS termination keepalive_timeout

https://gerrit.wikimedia.org/r/558985

Change 559053 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable compress plugin on ats-be

https://gerrit.wikimedia.org/r/559053

Change 559053 merged by Ema:
[operations/puppet@production] ATS: disable compress plugin on ats-be

https://gerrit.wikimedia.org/r/559053

Mentioned in SAL (#wikimedia-operations) [2019-12-19T07:30:26Z] <ema> cp: rolling ats-backend-restart to disable compress plugin everywhere T238494

ema added a comment.Dec 19 2019, 10:45 AM

Today at 7:30ish we've disabled the compress plugin everywhere. It's clearly buggy and introduces ~25-30ms slowdowns at p75 on cache hits. We're gonna have to either fix it or find alternative solutions in the future.

At this point the general p75 ttfb difference between ats-be (cp1079 in the example) and varnish-be (cp1089) in eqiad is ~15ms which isn't dramatic I think. The big dip in the graph is when we disabled compress.

Things are much different in esams, where ttfb for a test article from ats-be takes ~350ms at worse on a reused ats-be<->applayer connection and ~500ms at best on a newly established one. We knew of course that switcing from plain-http to TLS would have decreased performance in this case.

There are thus two fronts to work on now: (1) increase connection reuse, and (2) decrease the cost of establishing a new connection. There are obvious low-hanging fruits that come to mind for (2): we're currently using weighted round-robin as the load balancing policy for appservers.discovery.wmnet. Instead of that, we should switch to consistent hashing based on the client (ats-be) IP, enable TCP Fast Open, and make sure we're reusing TLS session appropriately.

(1) is trickier: we are seeing good connection reuse on restbase, currently serving 81 requests per second over 0.25 new connections per second, decent for api (363 rps over 4 cps) and bad for the mw appservers (410 rps over 20 cps). Playing with timeouts might help, and there's already some work ongoing in that direction (see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/558984/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/558985/) but ultimately improving connection reuse for mediawiki will likely require improving the behavior of the applayer too: there's very little ATS can do but timing out and closing the connection if, for example, an origin server stops sending output half-way. As another example, there are often situations where mediawiki origins are sending werid HTTP that causes parse errors:

[PARSE_ERROR] to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/w/index.php?title=Special:HideBanners&duration=604800&category=fundraising&reason=close'

In those cases too ATS understandably gives up and closes the connection.

Other than the above, it is going to be interesting to explore the option of always keeping a bunch of connections open and ready to be used as soon as requests come in. ATS seems to have the ability to do that (proxy.config.http.origin_min_keep_alive_connections), and we should work on this front next quarter.

ema added a subscriber: Vgutierrez.Dec 19 2019, 2:29 PM

There are thus two fronts to work on now: (1) increase connection reuse, and (2) decrease the cost of establishing a new connection. There are obvious low-hanging fruits that come to mind for (2): we're currently using weighted round-robin as the load balancing policy for appservers.discovery.wmnet. Instead of that, we should switch to consistent hashing based on the client (ats-be) IP, enable TCP Fast Open, and make sure we're reusing TLS session appropriately.

It turns out that ATS does not support TLS session reuse when acting as a client as found out by @Vgutierrez. Also, paraphrasing @BBlack, chashing is obviously a bad idea if the destination set is larger than the source. The rest of the update still stands.

Change 559816 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Switch to digicert-2019a for eqsin and esams

https://gerrit.wikimedia.org/r/559816

Change 559816 merged by BBlack:
[operations/puppet@production] Switch to digicert-2019a for eqsin and esams

https://gerrit.wikimedia.org/r/559816

Krinkle renamed this task from 15% response start regression starting around 2019-11-11 to 15% response start regression as of 2019-11-11 (Varnish->ATS).Jan 6 2020, 10:06 PM
Krinkle added a comment.EditedFeb 6 2020, 12:35 AM

Looks like we've recovered about 5-6%, but still consistantly regressed by a significant 9-10% overall at p75. Much more at the median, see below.

Global p75: +9%
e.g. 484ms raised to 535ms (+51ms)

For a change, I also looked at the 50th percentile, and discovered to my surprise that the difference is about the same amount of time there. This makes sense in retrospect, given that the delay is expected to be related to a server-side cost, and should not relate to the user's network latency.

However, at the 50th percentile a +50ms is much more significant. Globally that represents +20%. And in Europe that's perceived as +26%.

Global p50: +20%
e.g. 230ms raied to 278ms (+48ms)
Europe p50: +26%
e.g. 190ms arised to 241ms (+51ms)

Currently open patches:

[operations/puppet@production] mediawiki::webserver: increase TLS termination keepalive_timeout
https://gerrit.wikimedia.org/r/558985

[operations/puppet@production] ATS: increase keep_alive_no_activity_timeout_out on ats-be
https://gerrit.wikimedia.org/r/558984

Krinkle added a subscriber: Volans.EditedFeb 6 2020, 5:18 PM

In an IRC conversation with @Volans we considered whether it's possible that the 50ms regression comes from a change in distribution between cache-hit and cache-miss responses (as opposed to an added cost in the ATS processing for all requests and/or all cache-hit requests).

Below is a breakdown of the p50 and p75 navtiming.responseStart (aka time-to-first-byte, or TTFB), by country and by x-cache-status. The internal XCS response header is transplanted by VCL code into a Server-Timing header which browsers expose to our JS client and is collected together with the navtiming data (VCL code, JS code).

Raw data
2019-11-10
SELECT
nav.event.origincountry, srv.event.description,
PERCENTILE(nav.event.responsestart, 0.50) AS responsestart_p50,
PERCENTILE(nav.event.responsestart, 0.75) AS responsestart_p75,
COUNT(*) AS count
FROM event.navigationtiming AS nav JOIN event.servertiming AS srv ON nav.event.pageviewtoken = srv.event.pageviewtoken
WHERE nav.year = 2019 AND srv.year = 2019 AND nav.month = 11 AND srv.month = 11 AND nav.day = 10 AND srv.day = 10
AND nav.event.isoversample = false GROUP BY nav.event.origincountry,srv.event.description HAVING count > 1000;

origincountry   description     responsestart_p50       responsestart_p75       count
AR      hit-front       690.0   902.0   2516
AT      hit-front       204.0   301.75  1150
AU      hit-front       422.0   718.25  1836
BE      hit-front       152.0   252.25  1060
BR      hit-front       578.0   807.75  3074
BR      pass    815.0   1119.0  2149
CA      hit-front       173.0   294.0   3461
CL      hit-front       579.5   740.0   1106
CO      hit-front       417.0   612.25  1380
CZ      hit-front       201.5   299.75  1270
DE      hit-front       170.0   274.0   9597
DE      hit-local       154.0   249.5   1447
ES      hit-front       261.0   393.0   4555
FR      hit-front       223.0   346.0   6971
FR      hit-local       170.0   312.0   1011
GB      hit-front       150.0   253.0   7446
ID      hit-front       296.0   532.0   2887
IL      hit-front       317.0   457.25  1224
IN      hit-front       464.0   913.0   12147
IR      hit-front       559.0   791.25  1802
IT      hit-front       246.0   402.0   6684
IT      hit-local       201.0   354.0   1353
JP      hit-front       359.0   526.0   10811
KR      hit-front       527.0   790.0   1137
MX      hit-front       336.0   584.0   3367
NL      hit-front       130.0   224.0   1837
PE      hit-front       512.0   722.5   1076
PH      hit-front       344.0   644.0   1537
PL      hit-front       231.0   354.0   4503
RU      hit-front       290.0   491.0   6856
RU      hit-local       244.5   435.25  1060
TW      pass    1101.0  1324.0  1261
UA      hit-front       263.0   417.0   2512
US      hit-front       189.0   338.5   21183
US      hit-local       159.0   302.25  1804
US      miss    447.5   706.25  2102
US      pass    523.0   906.5   1043
Time taken: 72.454 seconds, Fetched: 37 row(s)
2020-02-26
SELECT
nav.event.origincountry, srv.event.description,
PERCENTILE(nav.event.responsestart, 0.50) AS responsestart_p50,
PERCENTILE(nav.event.responsestart, 0.75) AS responsestart_p75,
COUNT(*) AS count
FROM event.navigationtiming AS nav JOIN event.servertiming AS srv ON nav.event.pageviewtoken = srv.event.pageviewtoken
WHERE nav.year = 2020 AND srv.year = 2020 AND nav.month = 1 AND srv.month = 1 AND nav.day = 26 AND srv.day = 26
AND nav.event.isoversample = false GROUP BY nav.event.origincountry,srv.event.description HAVING count > 1000;

origincountry   description     responsestart_p50       responsestart_p75       count
AR      hit-front       749.5   1060.0  1856
AT      hit-front       257.0   521.0   1181
AU      hit-front       449.0   757.25  1980
BR      hit-front       638.0   909.0   4212
BR      miss    796.5   1184.0  1132
CA      hit-front       213.0   368.0   3644
CH      miss    312.0   428.0   1441
CO      hit-front       490.0   767.0   1365
CZ      hit-front       248.0   526.5   1111
DE      hit-front       209.0   449.0   9357
DE      miss    562.0   790.0   2176
ES      hit-front       301.0   543.0   4668
ES      miss    627.0   917.5   1339
FR      hit-front       273.0   548.0   6313
FR      miss    605.0   880.0   1761
GB      hit-front       198.0   417.0   7434
GB      miss    488.5   736.0   1328
GR      hit-front       374.5   692.0   1076
ID      hit-front       282.0   486.0   3065
IL      hit-front       413.0   701.25  1008
IN      hit-front       477.0   1056.0  12384
IR      hit-front       677.0   1104.0  1397
IT      hit-front       312.0   622.0   7439
IT      miss    611.0   909.0   1897
JP      hit-front       366.0   552.0   13099
KR      hit-front       629.0   1156.0  1043
MX      hit-front       361.0   605.0   3173
NL      hit-front       172.0   418.5   1675
PH      hit-front       333.5   615.25  1552
PL      hit-front       274.0   524.5   4283
PL      miss    598.5   856.75  1306
RU      hit-front       385.0   729.5   6419
RU      miss    730.0   1085.0  1823
SE      hit-front       200.0   416.75  1126
TH      hit-front       222.5   365.25  1224
TR      hit-front       846.0   1086.0  1073
TW      pass    1172.0  1436.0  1237
UA      hit-front       314.0   590.0   2267
US      hit-front       222.0   405.0   24359
US      miss    420.0   630.5   4151
US      pass    606.0   1009.0  1165
Time taken: 68.421 seconds, Fetched: 41 row(s)
Comparison

Note that this queries the unfragmented navtiming data, which means it encompasses all wikis, all users (with and without session), all page views (incl some non-articles), all (Grade A) browsers, and all devices (mobile/desktop).

The below is the subset of samples from Germany (DE), France (FR), and United Kingdom (GB) and United States (UA). Split by X-Cache-Status value. And limited to rows with 1000 or more samples in the day. This means the entries without e.g. miss might have a lower number in the unfiltered data.

DateCountryX-Cache-StatusTTFB p50TTFB p75Samples count
Sun 10 Nov 2019DEhit-front170.0274.09597
Sun 10 Nov 2019DEhit-local154.0249.51447
Sun 26 Jan 2020DEhit-front209.0449.09357
Sun 26 Jan 2020DEmiss562.0790.02176
DateCountryX-Cache-StatusTTFB p50TTFB p75Samples count
Sun 10 Nov 2019FRhit-front223.0346.06971
Sun 10 Nov 2019FRhit-local170.0312.01011
Sun 26 Jan 2020FRhit-front273.0548.06313
Sun 26 Jan 2020FRmiss605.0880.01761
DateCountryX-Cache-StatusTTFB p50TTFB p75Samples count
Sun 10 Nov 2019GBhit-front150.0253.07446
Sun 26 Jan 2020GBhit-front198.0417.07434
Sun 26 Jan 2020GBmiss488.5736.01328
DateCountryX-Cache-StatusTTFB p50TTFB p75Samples count
Sun 10 Nov 2019UShit-front189.0338.521,183
Sun 10 Nov 2019UShit-local159.0302.251804
Sun 10 Nov 2019USmiss447.5706.252102
Sun 10 Nov 2019USpass523.0906.51043
Sun 26 Jan 2020UShit-front222.0405.024,359
Sun 26 Jan 2020USmiss420.0630.54151
Sun 26 Jan 2020USpass606.01009.01165

Note that the recent data is from varnish-frontend commnicating with ats-be instead of varnish-backend (might influeunce how X-Cache-Status is defined). While the more recent data seem to have more miss values, there is not a significant decrease in hit values.

And, more importantly, the latency regression of ~50ms is visible even within the hit data.

Krinkle added a comment.EditedFeb 17 2020, 11:43 PM

I've updated some of the navtiming dashboards in Grafana to include a comparison line for "1 year ago" (instead of "3 months ago") because the November 2019 regression had disappeared from our graphs.

See updated Grafana dashboard (by platform) or (globally):

And Grafana dashboard, by continent:

Krinkle updated the task description. (Show Details)Mar 6 2020, 12:35 AM

@ema @Vgutierrez last time we talked about this a month ago, you were in the process of rolling out ATS-TLS changes. I believe the big one was T244464 which you hoped would improve performance.

Since then this has also been rolled out: T245616 as well as TLS 1.3 support on the upload cluster: T170567 as of today.

It doesn't appear that any of these changes have bridged the gap of response time performance this task is about compared to a year ago:

The 15% median regression is still very much there.

Are there any other upcoming performance improvements in the pipeline for our ATS setup in the coming months?

ema added a comment.Apr 7 2020, 5:57 AM

Are there any other upcoming performance improvements in the pipeline for our ATS setup in the coming months?

Valentin and I are currently swamped. Without opening the Traffic dashboard to avoid having a seizure early in the morning, I can tell you our top 3 priorities off the top of my head: (1) trying to bring at least some sanity back into purging T249325 (2) fixing varnish crashes, (3) hunting down a memory leak in ATS. There's much more but you get the idea.

With that being said, you and your team are more than welcome to continue the work I have left off here: T241145#6032819. It's fun and doesn't require much more than time, access to our servers, a bit of C++ knowledge, and an understanding of performance issues. :-)

Gilles added a comment.Oct 9 2020, 9:06 AM

@BBlack when we last discussed the subject of this task in a meeting recently, you mentioned that replacing ats-tls (the "prime suspect" in this long-lasting regression) with something else was being considered by the Traffic team for Q2 for a variety of reasons. Like any major software stack change of that nature we can indeed hope that it will make this issue go away.

I can't find any Traffic task about a potential migration away from ats-tls and when I enquired on IRC I believe @CDanis told me that it had only been discussed briefly as a possibility but that there were no plans yet (@CDanis correct me if I'm mischaracterising what you said). There's no mention of this idea on T244538: ats-tls performance issues under production load either. I'm also not seeing anything about this in Betterworks for Q2.

What's the status of this idea? Is it being researched in some way, where can we follow updates on the matter?