While investigating poor performance for backup traffic between eqiad and codfw, it was discovered packets were being dropped by asw devices in eqiad (see T274234).
Average usage is well within link capacity, but it is likely we are seeing microbursts as described here:
There are probably a variety of factors at play here, not least the relatively low buffer memory on our current-generation switches, physical topology and the fact that SPINE->LEAF links are running at 40G but SPINE->CR links are 10G.
The problem has been mitigated somewhat be re-allocating as much buffer memory as possible to active traffic classes (T284592), but the problem still exists. Creating this task to track further progress and to act as a parent task to any others we may create to deal with this issue.