Page MenuHomePhabricator

Packet Drops on Eqiad ASW -> CR uplinks
Closed, ResolvedPublic

Description

Background

While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).

Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB36095

There are probably a variety of factors at play here, not least the relatively low buffer memory on our current-generation switches, physical topology and the fact that SPINE->LEAF links are running at 40G but SPINE->CR links are 10G.

The problem has been mitigated somewhat be re-allocating as much buffer memory as possible to active traffic classes (T284592), but the problem still exists. Creating this task to track further progress and to act as a parent task to any others we may create to deal with this issue.

Related Objects

StatusSubtypeAssignedTask
Resolvedcmooney
Resolvedcmooney
ResolvedNone
ResolvedKormat
ResolvedJclark-ctr
Resolved Marostegui
Resolved Marostegui
DeclinedNone
DeclinedNone
ResolvedNone
DeclinedNone
DeclinedNone
ResolvedNone
Resolvedaborrero
ResolvedNone
Resolvedaborrero
ResolvedNone
ResolvedAndrew
Resolved Bstorm
Resolvedaborrero
ResolvedNone
ResolvedNone
Resolvedayounsi
ResolvedJclark-ctr
ResolvedPapaul
Resolved Cmjohnson
Resolvedayounsi
Resolvedcmooney

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In terms of further mitigation one thing we could possibly do in the short-term is to change how we configure our VRRP states.

Currently we configure VRRP primary/backup status the same on every Vlan connecting to a given switch VC / row. So for instance all the uplink traffic from asw2-a-eqiad traverses ae1 to cr1-eqiad. This is balanced, in terms of ingress traffic to the CRs, as other rows send all outbound traffic via ae2 to cr2-eqiad.

While this balances traffic across the CRs well, it means the Spine switches use their uplinks only to one CR at any given time, with the remaining staying idle unless there is a fault.

We could, instead, alternate the master/backup VRRP status between the public/private/analytics Vlans going to a given row. This would cause some traffic from row to go to CR1 (on one set of links), and other traffic to CR2 (over another set of links), depending on the Vlan. Where traffic would go would remain deterministic (not ECMP / hashed), but we would be utilizing all links and thus reducing the number of drops overall.

@ayounsi interested in your thoughts on this. I know we discussed other options but this seems maybe a quick change / easy mitigation step in the short term.

Another change that could help here would be to move the L3 gateway for hosts to the virtual-chassis.

i.e.:

  • Set up new, routed sub-interfaces between the ASWs and CRs.
  • Announce a default to the ASWs from each CR over these.
  • Configure the ASWs for BGP multipath, so they would use both these routes, and thus ECMP across available links.
  • Remove the current GW interfaces end-devices use from the CRs, and move those IPs to Vlan/irb interfaces on the ASW VC.

Basically in this scenario the L3 gateway for hosts becomes their directly-connected upstream switch. And from there traffic gets ECMP'd to both CRs, spreading traffic across the available links to mitigate the drops we see now where traffic goes from any given VC to one CR.

AFAIK this requires an additional license on the Juniper EX/QFX VC switches though, which complicates things somewhat.

With T304712: eqiad: Move links to new MPC7E linecard will give us the possibility to move to 40G uplinks (instead of 4x10G) for some rows: C as of now, and D once T308331: eqiad: move non WMCS servers out of rack D5 is done.
This could be a good trade of as its overall inexpensive (and a cleaner cabling). A downside though is further discrepancies between rows.

Good suggestion. The discrepancy isn't ideal but I think a little asymmetry is worth it if we can improve performance. +1

I’m going to close this task for now. The problem has been mitigated as best as possible with the current equipment we have.

In time replacing switch hardware and moving to higher bandwidth uplinks will resolve the remaining discards.