Higher failed fetches error rate on some caching servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	Sep 18 2019, 11:01 AM

Description

Some frontend caching servers appear to have a way higher failed fetches error rate on upload than others

Details

	Subject	Repo	Branch	Lines +/-
	ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be	operations/puppet	production	+4 -0
	ATS: Ensure that the origin timeout is also applied to parent servers	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T92298 Investigate our mitigation strategy for HTTPS response length attacks
Resolved	Vgutierrez	T170567 Support TLSv1.3
Resolved	Vgutierrez	T231433 Move cache upload cluster from nginx to ats-tls
Resolved	Vgutierrez	T233205 Higher failed fetches error rate on some caching servers

Event Timeline

jijiki created this task.Sep 18 2019, 11:01 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2019, 11:01 AM

jijiki added a parent task: T231433: Move cache upload cluster from nginx to ats-tls.Sep 18 2019, 11:01 AM

Vgutierrez moved this task from Backlog to Caching on the Traffic board.Sep 18 2019, 11:04 AM

Change 537625 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Ensure that the origin timeout is also applied to parent servers

https://gerrit.wikimedia.org/r/537625

gerritbot added a project: Patch-For-Review.Sep 18 2019, 11:06 AM

Change 537625 merged by Vgutierrez:
[operations/puppet@production] ATS: Ensure that the origin timeout is also applied to parent servers

https://gerrit.wikimedia.org/r/537625

Change 537630 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be

https://gerrit.wikimedia.org/r/537630

Change 537630 merged by Vgutierrez:
[operations/puppet@production] ATS: Avoid Proxy-Connection from spreading to varnish-fe and ats-be

https://gerrit.wikimedia.org/r/537630

Maintenance_bot removed a project: Patch-For-Review.Sep 18 2019, 12:10 PM

It looks like ats-tls setting Proxy-Connectionto Close is messing with varnish-fe<-->ats-be connections as it can be seen in https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=1568203523676&to=1568808323676&refresh=1m

Mentioned in SAL (#wikimedia-operations) [2019-09-18T12:18:34Z] <vgutierrez> restarting ats-tls to avoid spreading Proxy-Connection header - T233205

Solved by preventing Proxy-Connection from spreading across varnish-fe and ats-be, thanks for reporting the issue @jijiki

Vgutierrez mentioned this in T238509: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors.Nov 18 2019, 4:58 PM

Higher failed fetches error rate on some caching servers Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Higher failed fetches error rate on some caching servers
Closed, ResolvedPublic
Actions

Related Objects
Search...