ATS hitrate is currently being impacted by Vary:Cookie + non session cookies like WMF-Last-Access or GeoIP
|Resolved||Vgutierrez||T316338 strip non session cookies before cache lookup in ATS|
|Resolved||Vgutierrez||T316337 Phabricator was logging out users repeatedly (2022-08-26)|
Change 828564 abandoned by Vgutierrez:
[operations/puppet@production] trafficserver: Replace session cookies with Token=1 iff V:C isn't there
we don't need this in ATS-land as it already caches requests with cookies by default. lack of Vary:Cookie in the response is enough
this seems to be working and not breaking anything :). As a direct result cache hitrate shows up to a 100% increase in the text cluster at the ats layer: https://grafana.wikimedia.org/goto/acc0K6W4z?orgId=1
Images for future reference, as from https://grafana.wikimedia.org/d/O2sTrqZVk/backend-layer-performance?orgId=1&var-site=All&var-cluster=text&from=1661385600000&to=1662840000000. I created an annotation (tagged: operations, performance) with link to this task, to make it easier to correlate on other dashboards.
The latency improvement is huge. In Eqsin latency improved by 25% at the p75, e.g. from 475ms down to 350ms for the same time/day a week earlier. That's a 125ms drop!
In terms of cache effectiveness, internal cache hits went up from 600 to 1200 req/s in Esams, and 300 to 600 req/s in Eqsin.
Naturally, the reported cache hit is now double in some regions, e.g. from 2% up to 4% to Codfw. Note that our frontend cache hit ratio is and remains way higher at around 89% overall (upto 99.9% for ResourceLoader). This improvement is specifically at ATS backend, our second layer of caching.