ATS hitrate is currently being impacted by Vary:Cookie + non session cookies like WMF-Last-Access or GeoIP
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Vgutierrez | T316338 strip non session cookies before cache lookup in ATS | |||
Resolved | Vgutierrez | T316337 Phabricator was logging out users repeatedly (2022-08-26) |
Event Timeline
Change 826785 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Hide non session cookies during cache lookup
An initial test of https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785/6/modules/profile/files/trafficserver/default.lua (PS6) in cp6016 triggered T316337
Change 826866 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] varnish: Emit X-Varnish-Cluster for misc sites
Change 826866 merged by Vgutierrez:
[operations/puppet@production] varnish: Emit X-Varnish-Cluster for misc sites
Mentioned in SAL (#wikimedia-operations) [2022-08-29T08:55:40Z] <vgutierrez> test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338 T316337
Change 826785 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Hide non session cookies during cache lookup
Mentioned in SAL (#wikimedia-operations) [2022-08-29T10:09:10Z] <vgutierrez> test trafficserver: Hide non session cookies during cache lookup in drmrs - T316338 T316337
Mentioned in SAL (#wikimedia-operations) [2022-08-29T12:14:20Z] <vgutierrez> rolling restart of ats-be fleet wide to apply "Hide non session cookies during cache lookup" - T316338 T316337
Reverted by https://gerrit.wikimedia.org/r/c/operations/puppet/+/827566 which missed having the Bug: header line needed to be reported here by bots.
Change 828002 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Hide non session cookies during cache lookup
Mentioned in SAL (#wikimedia-operations) [2022-08-31T08:12:03Z] <vgutierrez> test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338
Mentioned in SAL (#wikimedia-operations) [2022-08-31T08:20:05Z] <vgutierrez> end test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338
Mentioned in SAL (#wikimedia-operations) [2022-08-31T11:04:09Z] <vgutierrez> test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338
Change 828002 merged by Vgutierrez:
[operations/puppet@production] trafficserver: Hide non session cookies during cache lookup
Mentioned in SAL (#wikimedia-operations) [2022-08-31T12:57:22Z] <vgutierrez> test trafficserver: Hide non session cookies during cache lookup in drmrs - T316338
Mentioned in SAL (#wikimedia-operations) [2022-08-31T14:08:50Z] <vgutierrez> deploy trafficserver: Hide non session cookies during cache lookup globally - T316338
Change 828564 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] trafficserver: Replace session cookies with Token=1 iff V:C isn't there
Change 828564 abandoned by Vgutierrez:
[operations/puppet@production] trafficserver: Replace session cookies with Token=1 iff V:C isn't there
Reason:
we don't need this in ATS-land as it already caches requests with cookies by default. lack of Vary:Cookie in the response is enough
this seems to be working and not breaking anything :). As a direct result cache hitrate shows up to a 100% increase in the text cluster at the ats layer: https://grafana.wikimedia.org/goto/acc0K6W4z?orgId=1
Images for future reference, as from https://grafana.wikimedia.org/d/O2sTrqZVk/backend-layer-performance?orgId=1&var-site=All&var-cluster=text&from=1661385600000&to=1662840000000. I created an annotation (tagged: operations, performance) with link to this task, to make it easier to correlate on other dashboards.
The latency improvement is huge. In Eqsin latency improved by 25% at the p75, e.g. from 475ms down to 350ms for the same time/day a week earlier. That's a 125ms drop!
In terms of cache effectiveness, internal cache hits went up from 600 to 1200 req/s in Esams, and 300 to 600 req/s in Eqsin.
Naturally, the reported cache hit is now double in some regions, e.g. from 2% up to 4% to Codfw. Note that our frontend cache hit ratio is and remains way higher at around 89% overall (upto 99.9% for ResourceLoader). This improvement is specifically at ATS backend, our second layer of caching.