While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).
Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:
Mitigation - Buffer Allocation
By default the switches reserve 50% of buffer memory for a "lossless" traffic class we don't use. It seems we can re-partition the space to dedicate the majority of it to best-effort instead.
This change should be rolled out to all EX/QFX switches across the network. Provisionally scheduled as follows:
|D||Complete - No Issues|
|C||Complete - No Issues|
|B||Tues July 27th 2021||15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)||T286061|
|A||Thurs July 29th 2021||15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)||T286032|