Packet Drops on Eqiad ASW -> CR uplinks
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Sep 23 2021, 12:02 PM

Description

Background

While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).

Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB36095

There are probably a variety of factors at play here, not least the relatively low buffer memory on our current-generation switches, physical topology and the fact that SPINE->LEAF links are running at 40G but SPINE->CR links are 10G.

The problem has been mitigated somewhat be re-allocating as much buffer memory as possible to active traffic classes (T284592), but the problem still exists. Creating this task to track further progress and to act as a parent task to any others we may create to deal with this issue.

Related Objects
Search...

Status	Assigned	Task
Resolved	cmooney	T291627 Packet Drops on Eqiad ASW -> CR uplinks
Resolved	cmooney	T284592 Adjust egress buffer allocations on ToR switches
Resolved	None	T286032 Switch buffer re-partition - Eqiad Row A
Resolved	Kormat	T284622 Rename dbstore1004 to db1183 and place it on m5
Resolved	Jclark-ctr	T286468 Relabel dbstore1004 to db1183
Resolved	• Marostegui	T286042 Move db1124 and db1125 to misc services temporarily
Resolved	• Marostegui	T286329 Move db1124 and db1125 back to test-cluster section
Declined	None	T286063 Fail over dumps web services to labstore1007 prior to July 20th network disruption
Declined	None	T286064 Fail over clouddb1013 and clouddb1014 prior to network disruption on Row A
Resolved	None	T286061 Switch buffer re-partition - Eqiad Row B
Declined	None	T286615 Widespread cloud ceph and hypervisor issues possible with reconfiguration of Eqiad Row B
Declined	None	T286616 Fail over clouddb1015 and clouddb1016 for network switch changes
Resolved	None	T286065 Switch buffer re-partition - Eqiad Row C
Resolved	aborrero	T286601 Stop some services before and healthcheck labstore1004/5 following row C network change
Resolved	None	T286613 check for ldap issues regarding seaborgium network blip for row C configuration change
Resolved	aborrero	T286614 Communicate wikireplicas outage and healthcheck the system after Eqiad Row C network changes
Resolved	None	T286069 Switch buffer re-partition - Eqiad Row D
Resolved	Andrew	T286598 Fail over clouddb1019, clouddb1020 for switch changes
Resolved	• Bstorm	T286599 Downtime? and healthcheck cloudstore1008/9 following row D network change
Resolved	aborrero	T286600 failover cloud NFS from labstore1007 to labstore1006
Resolved	None	T288036 Switch buffer re-partition - cloudsw1-c8-eqiad
Resolved	None	T288037 Switch buffer re-partition - cloudsw1-d5-eqiad
Resolved	ayounsi	T284593 Create an alert for output discards on network devices
Resolved	Jclark-ctr	T313463 eqiad: upgrade row C and D uplinks from 4x10G to 1x40G
Resolved	Papaul	T304712 eqiad: Move links to new MPC7E linecard
		Unknown Object (Task)
Resolved	• Cmjohnson	T308331 eqiad: move non WMCS servers out of rack D5
		Unknown Object (Task)
Resolved	ayounsi	T320566 Cr1-eqiad comms problem when moving to 40G row D handoff
Resolved	cmooney	T333377 eqiad row D switches upgrade

Event Timeline

cmooney created this task.Sep 23 2021, 12:02 PM

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptSep 23 2021, 12:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

cmooney triaged this task as High priority.Sep 23 2021, 12:03 PM

cmooney added a subtask: T284592: Adjust egress buffer allocations on ToR switches.

cmooney added a subscriber: ayounsi.

In terms of further mitigation one thing we could possibly do in the short-term is to change how we configure our VRRP states.

Currently we configure VRRP primary/backup status the same on every Vlan connecting to a given switch VC / row. So for instance all the uplink traffic from asw2-a-eqiad traverses ae1 to cr1-eqiad. This is balanced, in terms of ingress traffic to the CRs, as other rows send all outbound traffic via ae2 to cr2-eqiad.

While this balances traffic across the CRs well, it means the Spine switches use their uplinks only to one CR at any given time, with the remaining staying idle unless there is a fault.

We could, instead, alternate the master/backup VRRP status between the public/private/analytics Vlans going to a given row. This would cause some traffic from row to go to CR1 (on one set of links), and other traffic to CR2 (over another set of links), depending on the Vlan. Where traffic would go would remain deterministic (not ECMP / hashed), but we would be utilizing all links and thus reducing the number of drops overall.

@ayounsi interested in your thoughts on this. I know we discussed other options but this seems maybe a quick change / easy mitigation step in the short term.

Another change that could help here would be to move the L3 gateway for hosts to the virtual-chassis.

i.e.:

Set up new, routed sub-interfaces between the ASWs and CRs.
Announce a default to the ASWs from each CR over these.
Configure the ASWs for BGP multipath, so they would use both these routes, and thus ECMP across available links.
Remove the current GW interfaces end-devices use from the CRs, and move those IPs to Vlan/irb interfaces on the ASW VC.

Basically in this scenario the L3 gateway for hosts becomes their directly-connected upstream switch. And from there traffic gets ECMP'd to both CRs, spreading traffic across the available links to mitigate the drops we see now where traffic goes from any given VC to one CR.

AFAIK this requires an additional license on the Juniper EX/QFX VC switches though, which complicates things somewhat.

Maintenance_bot added a project: SRE.Sep 23 2021, 12:45 PM

RhinosF1 subscribed.Sep 23 2021, 1:13 PM

jcrespo mentioned this in T274234: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002.Sep 23 2021, 1:37 PM

cmooney added a subtask: T284593: Create an alert for output discards on network devices.Sep 23 2021, 4:34 PM

ayounsi closed subtask T284593: Create an alert for output discards on network devices as Resolved.Dec 7 2021, 8:41 AM

With T304712: eqiad: Move links to new MPC7E linecard will give us the possibility to move to 40G uplinks (instead of 4x10G) for some rows: C as of now, and D once T308331: eqiad: move non WMCS servers out of rack D5 is done.
This could be a good trade of as its overall inexpensive (and a cleaner cabling). A downside though is further discrepancies between rows.