Background
While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).
Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:
https://kb.juniper.net/InfoCenter/index?page=content&id=KB36095
Mitigation - Buffer Allocation
By default the switches reserve 50% of buffer memory for a "lossless" traffic class we don't use. It seems we can re-partition the space to dedicate the majority of it to best-effort instead.
This change should be rolled out to all EX/QFX switches across the network. Scheduled as follows:
Eqiad - Completed without issue.
Row | Date | Time | Task | Status |
---|---|---|---|---|
D | Complete - No Issues | |||
C | Complete - No Issues | |||
B | Complete - No Issues | |||
A | Complete - No Issues | |||
CODFW - TBC
To be scheduled following DC switchover back to eqiad in September.
CloudSW - Eqiad
The same issue with outbound drops is not insignificant across multiple interfaces of the "cloudsw" devices connecting WMCS endpoints in eqiad. The following two switches require the same change: