Page MenuHomePhabricator

15% response start regression starting around 2019-11-11
Open, HighPublic

Details

Related Gerrit Patches:
operations/puppet : productionvarnishmtail: new buckets for varnishttfb, use seconds for sum
operations/puppet : productionvarnishmtail: install varnishttfb.mtail
operations/puppet : productionvarnishmtail: add varnishttfb.mtail
operations/puppet : productionvarnishmtail: add origin server logging support
operations/puppet : productionATS: mark uncacheable responses as 'pass' in X-Cache-Int
operations/puppet : productionATS: pass uncacheable requests
operations/puppet : productionATS: add varnish-fe <-> ats-be TTFB histogram
operations/puppet : productionATS: add trafficserver_backend_client_requests_total to mtail
operations/puppet : productionATS: disable text@esams origin server request coalescing
operations/puppet : productionATS: allow to toggle request coalescing
operations/puppet : productionATS: add client<->ats-be interactions metrics
operations/puppet : productionSwitch to digicert-2019a in esams, temporarily
operations/puppet : productionATS: add milestone timing logs, update TTFetchHeaders
operations/puppet : productionATS: log Cache-Control and Set-Cookie responses from origin
operations/puppet : productionATS: re-use origin server connections for matching IPs
operations/puppet : productionATS: revise atsbackend.mtail ttfb buckets
operations/puppet : productionATS: add trafficserver_backend_connections_total
operations/puppet : productionATS: allow connection reuse for requests with Authorization
operations/puppet : productionRevert "ATS: explicitly skip the cache instead of hiding CC"
operations/puppet : productionRevert "ATS: do not coalesce uncacheable requests"
operations/puppet : productionATS: do not coalesce uncacheable requests
operations/puppet : productionATS: explicitly skip the cache instead of hiding CC
operations/puppet : productionRevert "cache: reimage cp3064 as text_ats"

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Noteworthy graph found by @ema:

https://grafana.wikimedia.org/d/w4TRwaxZz/local-backend-hitrate-varnish-vs-ats?panelId=4&fullscreen&orgId=1&from=now-30d&to=now

The pattern of each server getting migrated worsening the situation is very clear on that one (the spikes are when migrations happen).

Looking at this, we see that ATS BE caches fill up fairly quickly, in a couple of days:

https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?panelId=5&fullscreen&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3050&var-layer=backend&from=now-30d&to=now

This means that it's not an issue of the cache not being warm enough.

I see these possible simple explanations for a lower hit rate under ATS:

  • ATS cannot cache some URLs at all that Varnish could before. This might be visible by looking at the URL distribution, but @ema looked at that manually and couldn't see any obvious patterns.
  • ATS has much more storage overhead, resulting in an effective storage capacity reduction once a host is migrated
  • ATS' hotness algorithm is much less efficient than Varnish's for our traffic. Some hot objects that should be kept aren't.

To get closer to answers it would be useful to compare before and after or for a running Varnish BE vs an ATS BE of the same size:

  • Total objects stored
  • Object size distribution
  • Object age distribution

Lastly, is there any documentation about how ATS decides to keep an object in cache vs Varnish?

Change 552825 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: reimage cp3064 as text_ats"

https://gerrit.wikimedia.org/r/552825

Change 552825 merged by Ema:
[operations/puppet@production] Revert "cache: reimage cp3064 as text_ats"

https://gerrit.wikimedia.org/r/552825

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp3064.esams.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911251448_ema_128022.log.

Completed auto-reimage of hosts:

['cp3064.esams.wmnet']

Of which those FAILED:

['cp3064.esams.wmnet']

Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:22:58Z] <ema> cp3064 manual reboot after wmf-auto-reimage error: 'Unable to run wmf-auto-reimage-host: Failed to reboot_host' T238494

Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:36:09Z] <ema> cp3064 create filesystem on /dev/nvme0n1p1 (see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552547/) and reboot T238494

ema added a comment.Mon, Nov 25, 3:46 PM

Among the things I've checked to rule out obvious mistakes porting VCL to Lua:

  • Cookie responses without "session" or "token" are cached as expected
  • There are very few requests with non-oauth Authorization headers

Now that cp3064 has been reimaged back to varnish-be, let's pool it and see if there are obvious differences. Let's also start logging Cache-Control on ats-be as received by the origins.

Change 552853 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: log Cache-Control as received from the origin

https://gerrit.wikimedia.org/r/552853

Change 552862 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: do not coalesce uncacheable requests

https://gerrit.wikimedia.org/r/552862

ema added a comment.Tue, Nov 26, 1:56 PM

Seems like the varnish re-imaging and repooling of cp3064 helped:

Interestingly, the number of cache hits on cp3064 is not at all higher than on ATS nodes, rather it's comparable:

My theory at this point is that what hurts in terms of performance is request coalescing for uncacheable content, which https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552862/ should partially address. The patch ensures that we disable request coalescing when we know that the response is going to be uncacheable by looking at request data alone. Let's see if that helps.

Change 552076 merged by Ema:
[operations/puppet@production] ATS: explicitly skip the cache instead of hiding CC

https://gerrit.wikimedia.org/r/552076

Change 552862 merged by Ema:
[operations/puppet@production] ATS: do not coalesce uncacheable requests

https://gerrit.wikimedia.org/r/552862

Change 553123 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: explicitly skip the cache instead of hiding CC"

https://gerrit.wikimedia.org/r/553123

Change 553125 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: do not coalesce uncacheable requests"

https://gerrit.wikimedia.org/r/553125

Change 553125 merged by Ema:
[operations/puppet@production] Revert "ATS: do not coalesce uncacheable requests"

https://gerrit.wikimedia.org/r/553125

Change 553123 merged by Ema:
[operations/puppet@production] Revert "ATS: explicitly skip the cache instead of hiding CC"

https://gerrit.wikimedia.org/r/553123

Change 553132 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable coalescing for some uncacheable requests

https://gerrit.wikimedia.org/r/553132

Mentioned in SAL (#wikimedia-operations) [2019-11-26T16:01:25Z] <ema> cp3050: temporarily disable request coalescing to assess performance impact T238494

ema added a comment.Tue, Nov 26, 5:17 PM

The experiment with request coalescing disabled on cp3050 is running. Meanwhile I've noticed something likely interesting about the cost of a cache miss in our ATS setup compared to Varnish.

I wanted to compare the average TTFHeaders in milliseconds for 10 seconds of cache misses on both. On cp3050 (ATS) I've used:

for i in `seq 10`; do timeout 10 atslog-backend OriginStatus:200 | sed 's/TTFetchHeaders://' | awk '$5 > 0 { print $5 }' > ttfh.log ; awk '{sum+=$1} END {print sum/NR}' < ttfh.log; done

On cp3064 (Varnish):

for i in `seq 10`; do timeout 10 varnishncsa -q 'Berespstatus eq 200' -b -F "%{VSL:Timestamp:Beresp[3]}x" | awk '{print $1 * 1000.0}' > ttfh.log ; awk '{sum+=$1} END {print sum/NR}' < ttfh.log; done

Varnish seems to be doing better (min 207.629, max 220.99, avg 213.922) than ATS (min 276.688, max 310.805, avg 290.324)

Actual timings for the sake of completeness:

Varnish: 214.231, 213.997, 208.831, 212.544, 214.946, 207.629, 220.99, 218.841, 211.995, 215.216
ATS: 285.591, 284.163 281.787, 297.535, 289.144, 294.359, 290.53, 276.688, 310.805, 292.639

As a rough but perhaps even more telling test:

17:07:32 ema@cp3050.esams.wmnet:~
$ (time curl -v -H "Authorization: skip-the-cache" http://localhost:3128/wiki/Main_Page -H "Host: en.wikipedia.org") 2>&1 | grep real
real    0m0.651s
17:08:07 ema@cp3064.esams.wmnet:~
$ (time curl -v -H "Authorization: skip-the-cache" http://localhost:3128/wiki/Main_Page -H "Host: en.wikipedia.org") 2>&1 | grep real
real    0m0.259s

Among other things we might want to look into: on cp3050 there's currently about 80 cache misses per second being served without connection reuse (see sudo timeout 1 atslog-backend OriginStatus:200 | grep -c ConnReuse:0). All those need to establish a TCP connection to eqiad + TLS, probably one of the reasons for the slowdowns.

Mentioned in SAL (#wikimedia-operations) [2019-11-27T09:25:00Z] <ema> cp3050: re-enable request coalescing after performance experiment T238494

Mentioned in SAL (#wikimedia-operations) [2019-11-27T16:06:56Z] <ema> cp3050: set proxy.config.http.server_session_sharing.match to "ip" T238494

https://grafana.wikimedia.org/d/000000143/navigation-timing?refresh=5m&orgId=1&from=now-30d&to=now&var-source=navtiming2&var-metric=responseStart&var-percentile=p50

Latency remains elevated. Do we have a status update or better idea about the root cause? If so, have we considered reverting part of it (if possible) and/or to document why that is complicated/undesirable (no big deal either way, just want to make sure any progress/decision is recorded here).

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
Krinkle moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.

Change 553490 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: re-use origin server connections for matching IPs

https://gerrit.wikimedia.org/r/553490

Change 553696 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: allow connection reuse for requests with Authorization

https://gerrit.wikimedia.org/r/553696

Change 553696 merged by Ema:
[operations/puppet@production] ATS: allow connection reuse for requests with Authorization

https://gerrit.wikimedia.org/r/553696

ema added a comment.Fri, Nov 29, 3:01 PM

Latency remains elevated. Do we have a status update or better idea about the root cause? If so, have we considered reverting part of it (if possible) and/or to document why that is complicated/undesirable (no big deal either way, just want to make sure any progress/decision is recorded here).

We did not single out one specific cause, but the functional differences between our varnish-be and ats-be setups are multiple. Some still need to be investigated in performance terms. For instance, cache misses/pass traffic on a esams varnish-be result in a HTTP request hitting another varnish-be in eqiad. The latter does compress content, hence the response body sent across the ocean is compressed. With ATS we hit the origins in eqiad directly, and cannot just send Accept-Encoding due to applayer issues such as T125938, which might now be solved but need investigation. Hence, for now response body is sent uncompressed between eqiad origins and esams ATS. Another line of work is around TCP connection reuse, which is particularly important in our ATS setup as we use TLS to connect to the origins, making the price of establishing a new connection higher than in varnishland. See https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/553490/ which I'm planning on merging next week.

Given that we do have various rational paths to follow to study and eventually fill the performance gap, we consider the benefits of ATS (no more random outages with 503 storms, no need to restart the service every 3.5 days, removal of ipsec, ...) to be significant enough not to revert the migration in order to improve performance in the short term.

Krinkle added a comment.EditedFri, Nov 29, 9:31 PM

If we expect to fix it reasonably soon I suppose it's not worth reverting over indeed.

I do have a gut-feeling, though, that these two example you mention cannot (should not?) explain the latency regression we're seeing (uncompressed transfers, and backend connection reuse). Because, they both relate to cache misses, which are only supposed to be a small proportion of traffic. Generally too small to affect the 50th and 75th percentiles we monitor.

Change 553894 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add trafficserver_backend_connections_count

https://gerrit.wikimedia.org/r/553894

Change 553894 merged by Ema:
[operations/puppet@production] ATS: add trafficserver_backend_connections_total

https://gerrit.wikimedia.org/r/553894

Change 554062 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: revise atsbackend.mtail ttfb buckets

https://gerrit.wikimedia.org/r/554062

Change 554062 merged by Ema:
[operations/puppet@production] ATS: revise atsbackend.mtail ttfb buckets

https://gerrit.wikimedia.org/r/554062

Change 553490 merged by Ema:
[operations/puppet@production] ATS: re-use origin server connections for matching IPs

https://gerrit.wikimedia.org/r/553490

Mentioned in SAL (#wikimedia-operations) [2019-12-02T14:31:13Z] <ema> cp-ats: merge server_session_sharing.match=2 (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/553490/) with puppet disabled, test on cp3050 T238494

Mentioned in SAL (#wikimedia-operations) [2019-12-02T14:46:24Z] <ema> cp-ats: set server_session_sharing.match=2 everywhere (puppet re-enable and run) T238494

Mentioned in SAL (#wikimedia-traffic) [2019-12-02T16:41:03Z] <ema> cp3050: ats-be restart with proxy.config.http.server_session_sharing.pool=global T238494

Mentioned in SAL (#wikimedia-operations) [2019-12-03T05:50:03Z] <ema> cp3050: ats-be restart with proxy.config.http.server_session_sharing.pool=thread T238494

Change 552853 merged by Ema:
[operations/puppet@production] ATS: log Cache-Control and Set-Cookie responses from origin

https://gerrit.wikimedia.org/r/552853

Mentioned in SAL (#wikimedia-operations) [2019-12-03T08:35:56Z] <ema> cp3050: set cache.max_open_read_retries=-1 and proxy.config.http.cache.max_open_write_retries=1 (default values) T238494

ema added a comment.Tue, Dec 3, 10:28 AM

I do have a gut-feeling, though, that these two example you mention cannot (should not?) explain the latency regression we're seeing (uncompressed transfers, and backend connection reuse). Because, they both relate to cache misses, which are only supposed to be a small proportion of traffic. Generally too small to affect the 50th and 75th percentiles we monitor.

Well, esams text traffic hitting the origins is between 10% and 25% depending on the time of day, which isn't really a small proportion. Although we don't have many misses, we do have loads of pass traffic due to uncacheable responses.

Anyways as an update, yesterday 2019-12-02 around 14:00 UTC I've merged a change that improved origin connection reuse, from ~20 cps to ~3 against api servers and ~50 cps to ~20 against appservers:

We'll see if the change has any positive impact during today's esams peak traffic hours (15:00).

During yesterday's traffic meeting @BBlack and I discussed reverting the TLS cert changes mentioned earlier (T238494#5672180), something which should have a positive impact on responseStart. Hopefully we'll manage to do that this week!

Change 554291 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Switch to digicert-2019a in esams, temporarily

https://gerrit.wikimedia.org/r/554291

Mentioned in SAL (#wikimedia-operations) [2019-12-03T13:56:47Z] <ema> cp-esams: disable puppet in preparation of digicert-2019a cert switch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554291/ T238494

Change 554291 merged by Ema:
[operations/puppet@production] Switch to digicert-2019a in esams, temporarily

https://gerrit.wikimedia.org/r/554291

Mentioned in SAL (#wikimedia-operations) [2019-12-03T14:00:19Z] <ema> cp3050: depool and switch to digicert-2019a T238494

Mentioned in SAL (#wikimedia-operations) [2019-12-03T14:06:14Z] <ema> repool cp3050 with digicert-2019a T238494

Mentioned in SAL (#wikimedia-operations) [2019-12-03T14:13:37Z] <ema> cp-esams: re-enable puppet, switch to digicert-2019a certs https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554291/ T238494

Mentioned in SAL (#wikimedia-operations) [2019-12-03T14:24:37Z] <ema> all cp-esams hosts switched to digicert-2019a certs T238494

Gilles added a comment.Wed, Dec 4, 8:24 AM

As expected the SSL cert rollback undoes all of the TLS handshake regression:


https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&fullscreen&panelId=54&from=now-90d&to=now&var-metric=ssl&var-location=All&var-prop=p75

But that only makes a small dent in the TTFB regression:


https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&fullscreen&panelId=54&from=now-90d&to=now&var-metric=responseStart&var-location=All&var-prop=p75

The bulk of the extra latency associated with ATS remains (although we see that the recent efforts by @ema seem to have brought gradual contributions as well).

TTFB/responseStart in Europe/esams is still 15% slower than it used to be prior to the ATS migration.

Gilles renamed this task from 200ms / 50% response start regression starting around 2019-11-11 to 15% response start regression starting around 2019-11-11.Wed, Dec 4, 8:26 AM

Change 554483 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add milestone timing to logs, update TTFetchHeaders

https://gerrit.wikimedia.org/r/554483

Change 554483 merged by Ema:
[operations/puppet@production] ATS: add milestone timing logs, update TTFetchHeaders

https://gerrit.wikimedia.org/r/554483

Change 554503 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add atsbackendttfb.mtail

https://gerrit.wikimedia.org/r/554503

Change 554503 merged by Ema:
[operations/puppet@production] ATS: add client<->ats-be interactions metrics

https://gerrit.wikimedia.org/r/554503

Change 554558 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: allow to toggle request coalescing

https://gerrit.wikimedia.org/r/554558

Change 554570 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add trafficserver_backend_client_requests_total to mtail

https://gerrit.wikimedia.org/r/554570

Change 554558 merged by Ema:
[operations/puppet@production] ATS: allow to toggle request coalescing

https://gerrit.wikimedia.org/r/554558

Change 554808 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable text@esams origin server request coalescing

https://gerrit.wikimedia.org/r/554808

Change 554808 merged by Ema:
[operations/puppet@production] ATS: disable text@esams origin server request coalescing

https://gerrit.wikimedia.org/r/554808

Mentioned in SAL (#wikimedia-operations) [2019-12-05T09:54:17Z] <ema> text@esams: disable ats-be origin server request coalescing T238494

Change 554863 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add varnish-fe <-> ats-be TTFB histogram

https://gerrit.wikimedia.org/r/554863

Change 554570 abandoned by Ema:
ATS: add trafficserver_backend_client_requests_total to mtail

Reason:
Made obsolete by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /554863/

https://gerrit.wikimedia.org/r/554570

Change 554863 merged by Ema:
[operations/puppet@production] ATS: add varnish-fe <-> ats-be TTFB histogram

https://gerrit.wikimedia.org/r/554863

ema added a comment.EditedFri, Dec 6, 10:04 AM

As a way to identify more specifically where the TTFB regression comes from, in particular to understand precisely how much ats-be contriubtes to it, I thought of analyzing TTFB as seen by varnish-fe when acting as a client (that is: when varnish-fe fetches objects from its backends, which can be either ats-be or varnish-be).

To this end I have gathered 60 seconds of fetches initiated by varnish-fe (miss/pass traffic from varnish-fe POV, in other words: non-hits at the frontend layer) and printed TTFB as well as X-Cache-Int. Notice the -b argument, "only backend requests trigger log lines".

timeout 60 varnishncsa -n frontend -b -F '%{Varnish:time_firstbyte}x %{X-Cache-Int}o' > x.log

For each cache_text backend in esams I have then computed the median, p75 and p95 TTFB in milliseconds:

for host in cp3050 cp3052 cp3054 cp3056 cp3058 cp3060 cp3062 cp3064 ; do
    printf "$host "; sed -E 's/(hit|miss|pass|,|\/[0-9]+)//g' x.log | awk '{print $1*1000, $NF}' |
        grep $host | sort -n | awk '{v[NR]=$1} END {print int(v[int(NR*.5)]), int(v[int(NR*.75)]), int(v[int(NR*.95)])}'
done

The results follow; cp3064 is the only varnish-be, and the last column shows the p75 difference between ATS and Varnish, which is consistently ~60/70ms. If we add the ~40ms introduced by the TLS cert change T238494#5711289, I think we get pretty close to the whole responseStart regression tracked here.

hostp50p75p95softwarep75 diff
cp3050163252624ats69
cp3052165251480ats68
cp3054160248468ats65
cp3056162247473ats64
cp3058164254482ats71
cp3060162255514ats72
cp3062158244469ats61
cp3064118183381varnish0

The results above include both backend miss/pass and backend hits, but there are also very large differences when it comes to backend hits (we filter with grep hit, not perfect as that also includes remote hits for cp3064, but good enough given that it's p75):

for host in cp3050 cp3052 cp3054 cp3056 cp3058 cp3060 cp3062 cp3064 ; do
    printf "|$host "; grep hit x.log | sed -E 's/(hit|miss|pass|,|\/[0-9]+)//g'  | awk '{print $1*1000, $NF}' |
        grep $host | sort -n | awk '{v[NR]=$1} END {print "|", v[int(NR*.5)], "|", v[int(NR*.75)], "|", v[int(NR*.95)]}'; done
hostp50p75p95softwarep75 diff
cp30509.14812.26462.599ats11.729
cp30529.41812.51851.579ats11.983
cp30548.83512.00854.231ats11.473
cp30568.89813.7553.285ats13.215
cp30588.97811.73648.262ats11.201
cp30608.79212.05350.347ats11.518
cp30628.8412.20848.361ats11.673
cp30640.3510.5350.875varnish0

The next step is to understand why ATS introduces this visible performance penalty. As the lowest-possible hanging fruit, I have tried logging on cp3050 TS_MILESTONE_PLUGIN_ACTIVE (time running plugin code) and TS_MILESTONE_PLUGIN_TOTAL (time spent in or waiting for plugins). See the docs for TSHttpTxnMilestoneGet. Out of 120k requests, only 3 had either PLUGIN_ACTIVE or PLUGIN_TOTAL greater than 0 (two requests had 1ms for both, and one had 2ms). This seems to indicate that plugins, including Lua, are not to blame.

Could you generate a separate stats table for misses and passthroughs?

ema added a comment.Fri, Dec 6, 10:25 AM

Could you generate a separate stats table for misses and passthroughs?

Certainly. Non-hits (filtered with grep -v hit):

hostp50p75p95softwarep75 diff
cp3050184271675ats42
cp3052187271507ats42
cp3054180268498ats39
cp3056183268517ats39
cp3058187272525ats43
cp3060185276554ats47
cp3062178261509ats32
cp3064159229492varnish0

I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthroughs (ATS/Varnish merely acted as a proxy), separately. One table for each. Is it possible to separate?

ema added a comment.Fri, Dec 6, 10:38 AM

I meant specifically misses (ATS/Varnish did a lookup a didn't find the object) vs passthroughs (ATS/Varnish merely acted as a proxy), separately. One table for each. Is it possible to separate?

Not for ATS, at the moment they're all reported as miss. I was planning on fixing that eventually though, and this seems the right time. I assume you're asking to check if cache lookups can contribute to the performance penalty? We do log the precise CacheReadTime, so that's definitely something we can analyze.

Yes, from a cache application perspective they are different tasks and therefore the issues affecting each could have different causes. If the overhead is high for passthrough as well, it's particularly suspicious, as ATS shouldn't be doing anything else than running a regex to figure out that it's a passthrough path and then proxying things transparently, right? Or does it do some kind of decompression/recompression or any type of processing for passthrough requests?

Change 555396 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: mark uncacheable responses as 'pass' in X-Cache-Int

https://gerrit.wikimedia.org/r/555396

Change 553132 merged by Ema:
[operations/puppet@production] ATS: pass uncacheable requests

https://gerrit.wikimedia.org/r/553132

Mentioned in SAL (#wikimedia-operations) [2019-12-06T13:41:47Z] <ema> cp2004: adding do_global_ doesn't seem to work with reload, restart ats-be T238494

Change 555396 abandoned by Ema:
ATS: mark uncacheable responses as 'pass' in X-Cache-Int

Reason:
The idea here is wrong. Thanks to hit-for-pass, Varnish can skip cache lookups and coalescing for responses that are known to be uncacheable. In ATS hfp does not exist, hence it would be wrong to label as "pass" things for which we did lookup the cache.

https://gerrit.wikimedia.org/r/555396

ema added a comment.Fri, Dec 6, 4:45 PM

We can now distinguish between hit, miss, and pass in text@esams ATS too.

An important caveat when looking at these numbers is that Varnish supports hit-for-pass, essentially allowing to pass much more traffic by remembering that responses to certain requests were uncacheable (notably, those with Cache-Control: private/no-cache). ATS does not support such feature, hence we can decide to pass solely based on the request. This explains the difference between the number of pass requests (column "pass n") and miss ("miss n"), the former being much higher on Varnish and misses being higher on ATS.

hosthit p75hit diffhit nmiss p75miss diffmiss npass p75pass diffpass n
cp305011.81711.359944377.97873.855569398.128144.747937
cp305212.19311.735886350.6546.5225478380.078126.697879
cp305411.97111.513731363.51259.3844547397.028143.647846
cp305614.9114.452784382.74678.6184782412.411159.03794
cp305812.34511.887877368.43664.3085811407.02153.639958
cp306011.84911.391755357.01552.8874787393.6140.219878
cp306211.74911.291763348.744.5724969377.698124.317816
cp30640.45802647304.12801539253.38103778

Some initial observations:

  • On cp3064 (Varnish) things look reasonable: cache hits are very cheap, misses are the most expensive (cache lookup needed) and pass is cheaper than miss
  • On ATS hits are cheaper than anything else (luckily!) but there's a ~11ms slowdown compared to Varnish which may or may not be something to worry about at this stage given the low hitrate at the backend layer
  • Pass is more expensive than miss (!) and both introduce a significant regression compared to Varnish

Next week I am planning on studying the overhead introduced by ATS, and I was thinking of defining a new metric called proxy_overhead = client_ttfb - origin_ttfb, with client_ttfb being the number of milliseconds between the moment when ATS starts processing the request from varnish-fe and when the first byte of response to varnish-fe is written. origin_ttfb, instead, is the amount of time between the moment when ATS first tries to contact the origin (or recycle a connection) and when the first byte from the origin is received. The difference between these two, I think, should be a useful indicator of how much overhead ATS (or a proxy in general) is introducing.

ema added a comment.EditedMon, Dec 9, 7:58 AM

Analysis repeated right now capturing requests for 60s. The numbers don't look as bad. p75 TTFB in milliseconds:

hosthitmisspass
cp3062 (ATS)12.385266.964226.537
cp3064 (Varnish)0.421268.309205.125
diff11.964-1.321

We should add this data to prometheus to try draw useful conclusions, see T240180.

Gilles added a comment.Mon, Dec 9, 8:37 AM

We're still seeing an extra 100-150ms on the p75 TTFB reported by clients in Europe compared to before 11/11. Only 20ms of which can be attributed to TLS.

Change 555909 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: add origin server logging support

https://gerrit.wikimedia.org/r/555909

Change 555909 merged by Ema:
[operations/puppet@production] varnishmtail: add origin server logging support

https://gerrit.wikimedia.org/r/555909

Change 555930 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: add varnishttfb.mtail

https://gerrit.wikimedia.org/r/555930

Change 555930 merged by Ema:
[operations/puppet@production] varnishmtail: add varnishttfb.mtail

https://gerrit.wikimedia.org/r/555930

Change 555953 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: install varnishttfb.mtail

https://gerrit.wikimedia.org/r/555953

Change 555953 merged by Ema:
[operations/puppet@production] varnishmtail: install varnishttfb.mtail

https://gerrit.wikimedia.org/r/555953

Change 556139 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: new buckets for varnishttfb, use seconds for sum

https://gerrit.wikimedia.org/r/556139

Change 556139 merged by Ema:
[operations/puppet@production] varnishmtail: new buckets for varnishttfb, use seconds for sum

https://gerrit.wikimedia.org/r/556139