Page MenuHomePhabricator

Occasional saturation of asw2-b-eqiad / cr port uplink and cache upload usage
Closed, ResolvedPublic

Description

There have been "port utilization over 80%" pages over the last few days, the culprit being a 10G port between asw2-b-eqiad and cr1, e.g.:

Rule:  Primary inbound port utilisation over 80%  #page Physical Interface: xe-2/0/45
Interface Description: Core: cr1-eqiad:xe-3/2/3 {#3457}
Interface Speed: 10 Gbs
Inbound Utilization: 94.5499036
Outbound Utilization: 25.91466184

and the corresponding page on the other side

Rule:  Primary outbound port utilisation over 80%  #page Physical Interface: xe-3/2/3
Interface Description: Core: asw2-b-eqiad:xe-2/0/45 {#3457}
Interface Speed: 10 Gbs
Inbound Utilization: 25.48847968
Outbound Utilization: 93.1216988

Looking at the graphs at https://librenms.wikimedia.org/device/161/ports/graphs?type=bits it seems the "shape" of traffic increase matches cp1107 interface stats at https://librenms.wikimedia.org/device/device=161/tab=port/port=31460/ to the point that its 10G interface is fully saturated.

I took a look at superset and the initial bump of traffic on Dec 3rd does match WME user agent starting to show up: https://superset.wikimedia.org/superset/dashboard/p/odArMWQvXlQ/

Unclear to me (Filippo) at this time if the latest jump in traffic (+2gbps) on cp1107 is also WME or other kind of traffic

Event Timeline

Contacted WME SRE that kindly agreed to lower current requests parallelism and check for results

Change #1101547 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] WIP

https://gerrit.wikimedia.org/r/1101547

Change #1101547 merged by CDanis:

[operations/puppet@production] Skip cache on WME upload.wm.o HEAD reqs

https://gerrit.wikimedia.org/r/1101547

Adding a comment to not forget:

Change #1101561 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Skip cache on all WME upload.wm.o reqs

https://gerrit.wikimedia.org/r/1101561

Mentioned in SAL (#wikimedia-operations) [2024-12-09T17:15:15Z] <cdanis> 💙cdanis@cumin1002.eqiad.wmnet ~ 🕛☕ sudo cumin 'A:cp' 'disable-puppet "cdanis testing in production I464702d8fb T381771"'

Change #1101561 merged by CDanis:

[operations/puppet@production] Skip cache on all WME upload.wm.o reqs

https://gerrit.wikimedia.org/r/1101561

Mentioned in SAL (#wikimedia-operations) [2024-12-09T17:18:26Z] <cdanis> T381771 💙cdanis@cp1107.eqiad.wmnet ~ 🕧☕ sudo run-puppet-agent --force

Mentioned in SAL (#wikimedia-operations) [2024-12-09T17:44:13Z] <cdanis> 💙cdanis@cumin1002.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:cp' 'enable-puppet "cdanis testing in production I464702d8fb T381771"'

BCornwall changed the task status from Open to In Progress.Dec 9 2024, 6:32 PM
BCornwall triaged this task as High priority.
BCornwall moved this task from Backlog to Actively Servicing on the Traffic board.
CDanis closed this task as Resolved.EditedDec 9 2024, 10:11 PM
CDanis claimed this task.
CDanis added subscribers: BBlack, CDanis.

@BBlack and I got to the bottom of this today. It felt great.

  • Varnish internally 'upgrades' HEAD to GET when talking to ATS. Example occurrence in P71644
12:54:23	<bblack>	ok I think I understand the varnish behavior now (for upload.wm.o files >=8MB in size, anyways), which I'll try to summarize:
12:56:39	<bblack>	client HEAD on a new unique URI that no cache has seen before: varnish convert the miss to a GET towards ATS.  ATS in turn also GETs the file from Swift.  Varnish sees response to GET with Content-Length >=8MB, and then marks this URI in cache as hit-for-pass for future requests, and synthesizes a HEAD-style response to client (forwards the headers, but not the body).
12:57:25	<bblack>	second HEAD request through the same caches for the same file: Varnish sees hit-for-pass object, and so it passes the HEAD request through to ATS directly (now ATS sees HEAD instead of GET).
12:58:21	<bblack>	so, varnish does convert HEAD->GET by default at its layer.  Unless it already did that once, noticed the excessive size, and marked the URI as HFP, in which case now ATS will see it as HEAD as well.
12:58:54	<bblack>	given they're scanning they're likely to miss, though, so that's probably why most of them land at ATS already as GETs rather than HEADs
12:59:27	<bblack>	I'm still not sure why refusing to cache at the ATS layer helped, though.
  • ATS was previously caching the result of these HEADs-become-GETs, which it turns out cratered the usual hit rate you see on upload-cp. It was also using a whole lot of Lua threads and other resources inside ATS.
  • Turning off ATS's caching of these requests (patch) seems to mean that, when Varnish decides it's ineligible for caching in Varnish because it's too big an object, and hangs up on ATS, then, ATS hangs up on the Swift backend too, saving most of the inbound bytes. Aside from directly (if inelegantly) fixing the network rx saturation issues, some of the other evidence for this is the great increase in ATS->Swift new connections rate on cp1107, presumably because terminated connections can't be reused.

Some followups that might make nice Traffic tasks for the backlog:

  • In our ATS Lua, we used no_cache_lookup(). We should check if instead doing do_not_cache() would preserve ATS's hangups, as the other behavior of that might be better some of the time.
  • "at least if we limited it to WME UA and/or IP, it might be nice to just pass HEAD through as HEAD at all layers, uncacheable." @BBlack proposes something like P71676 which, while untested, hooks vcl_miss() so that we only trigger this behavior when we don't already have the object in cache from other users
  • Add WME IPs as an ipblock in requestctl?
  • Research the general case of if Varnish's default behavior of upgrading HEAD to GET and caching optimistically overall makes sense for us. (This can probably be answered with some webrequest logs digging?)

From the Swift point-of-view, sometimes "client gave up on this read" means "because swift was overwhelmed", so if ATS could say HEAD when the client said HEAD, that would ideally be better than "GET, then tear it down"

Change #1101909 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] varnish: pass WME HEAD reqs to ATS

https://gerrit.wikimedia.org/r/1101909

Change #1101909 abandoned by Fabfur:

[operations/puppet@production] varnish: pass WME HEAD reqs to pass for ATS

https://gerrit.wikimedia.org/r/1101909