While investigating T200563, I realized that some wdqs servers were dropping a significant number of packets. A quick look at prometheus node_network_receive_drop shows that a number of other servers see significant packet drop (ganeti, elasticsearch, restbase, thumbor, ...).
After a quick chat with @fgiunchedi, a proposed check could be:
- check that number of packets dropped over a period P is below threshold T
- run that check at a relatively low frequency F
- P = 24h, T = 1k, F = 1/6h
Since this is a fleet wide check, feedback is welcomed before we implement this.
It would make sense to also add node_network_transmit_drop to check if there are transmit drops