Page MenuHomePhabricator

cr1-codfw<->cr1-eqiad link saturation
Closed, ResolvedPublic

Description

Today we hit 8.8Gbps and probably temporarily saturated our active link between codfw and eqiad.

Some of this was eqiad caches warming up after we repooled eqiad edge out of codfw peering/transit capacity concerns (originally, the codfw/eqdfw link was running very hot).

A fair deal of this traffic (probably 1.5Gbps) is upload-lb@eqiad pulling content out of Swift via codfw, because we have Swift depooled there, which we should talk about fixing tomorrow (cc @fgiunchedi ). But it's late in the day here and I didn't want to touch this just now, as current utilization on the link is merely warm and not burning.

There's another component where I think we still don't know exactly what it is, TBD. (Internal netflow would help here, but...)

There's also some likely future work about better understanding our between-core-sites network needs, esp. if we need to re-pool an edge during zenith after it has been depooled and cold for days.

Event Timeline

(an update: duh, we have ~3Gbit/s of codfw-->esams traffic that is traversing eqiad)

for posterity: repooling swift@eqiad took 3.5Gbit/s off of the codfw->eqiad path.

there's a much longer discussion (recorded in #wikimedia-sre logs) about discussing edge-egress-to-backhaul bytes ratios, and overall backbone network provisioning, and whatever happened with HTTP compression from applayer->edge caches (cf T125938) and possibly revisiting that with Envoy doing the compression instead of MW or Apache.

This particular issue is resolved for now, and the action items and other ideas spawned in the discussion of it will be tracked as sub-tasks of T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress

CDanis claimed this task.