Page MenuHomePhabricator

Packet Drops on Eqiad ASW -> CR uplinks
Open, HighPublic

Description

Background

While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).

Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB36095

There are probably a variety of factors at play here, not least the relatively low buffer memory on our current-generation switches, physical topology and the fact that SPINE->LEAF links are running at 40G but SPINE->CR links are 10G.

The problem has been mitigated somewhat be re-allocating as much buffer memory as possible to active traffic classes (T284592), but the problem still exists. Creating this task to track further progress and to act as a parent task to any others we may create to deal with this issue.

Related Objects

StatusSubtypeAssignedTask
Opencmooney
OpenNone
ResolvedNone
ResolvedKormat
ResolvedJclark-ctr
ResolvedMarostegui
ResolvedMarostegui
DeclinedNone
DeclinedNone
ResolvedNone
DeclinedNone
DeclinedNone
ResolvedNone
Resolvedaborrero
ResolvedNone
Resolvedaborrero
ResolvedNone
ResolvedAndrew
ResolvedBstorm
Resolvedaborrero
ResolvedNone
ResolvedNone
In Progressayounsi

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In terms of further mitigation one thing we could possibly do in the short-term is to change how we configure our VRRP states.

Currently we configure VRRP primary/backup status the same on every Vlan connecting to a given switch VC / row. So for instance all the uplink traffic from asw2-a-eqiad traverses ae1 to cr1-eqiad. This is balanced, in terms of ingress traffic to the CRs, as other rows send all outbound traffic via ae2 to cr2-eqiad.

While this balances traffic across the CRs well, it means the Spine switches use their uplinks only to one CR at any given time, with the remaining staying idle unless there is a fault.

We could, instead, alternate the master/backup VRRP status between the public/private/analytics Vlans going to a given row. This would cause some traffic from row to go to CR1 (on one set of links), and other traffic to CR2 (over another set of links), depending on the Vlan. Where traffic would go would remain deterministic (not ECMP / hashed), but we would be utilizing all links and thus reducing the number of drops overall.

@ayounsi interested in your thoughts on this. I know we discussed other options but this seems maybe a quick change / easy mitigation step in the short term.

Another change that could help here would be to move the L3 gateway for hosts to the virtual-chassis.

i.e.:

  • Set up new, routed sub-interfaces between the ASWs and CRs.
  • Announce a default to the ASWs from each CR over these.
  • Configure the ASWs for BGP multipath, so they would use both these routes, and thus ECMP across available links.
  • Remove the current GW interfaces end-devices use from the CRs, and move those IPs to Vlan/irb interfaces on the ASW VC.

Basically in this scenario the L3 gateway for hosts becomes their directly-connected upstream switch. And from there traffic gets ECMP'd to both CRs, spreading traffic across the available links to mitigate the drops we see now where traffic goes from any given VC to one CR.

AFAIK this requires an additional license on the Juniper EX/QFX VC switches though, which complicates things somewhat.