Some frontend caching servers appear to have a way higher failed fetches error rate on upload than others
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T92298 Investigate our mitigation strategy for HTTPS response length attacks | |||
Resolved | Vgutierrez | T170567 Support TLSv1.3 | |||
Resolved | Vgutierrez | T231433 Move cache upload cluster from nginx to ats-tls | |||
Resolved | Vgutierrez | T233205 Higher failed fetches error rate on some caching servers |
Event Timeline
Change 537625 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Ensure that the origin timeout is also applied to parent servers
Change 537625 merged by Vgutierrez:
[operations/puppet@production] ATS: Ensure that the origin timeout is also applied to parent servers
Change 537630 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be
Change 537630 merged by Vgutierrez:
[operations/puppet@production] ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be
It looks like ats-tls setting Proxy-Connectionto Close is messing with varnish-fe<-->ats-be connections as it can be seen in https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=1568203523676&to=1568808323676&refresh=1m
Mentioned in SAL (#wikimedia-operations) [2019-09-18T12:18:34Z] <vgutierrez> restarting ats-tls to avoid spreading Proxy-Connection header - T233205
Solved by preventing Proxy-Connection from spreading across varnish-fe and ats-be, thanks for reporting the issue @jijiki