Yesterday, we got a page for (times in UTC):
18:27:46 <+jinxer-wm> FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
On some inspection and thanks to a recommendation by @CDanis, the suggestion was that internal traffic and related to the Analytics cluster. network_flows_internal takes a while to catch up on which host it was but it seems like it was an-worker1165.eqiad.wmnet and 10.64.157.4. You can see it in Turnilo at https://w.wiki/A4ym.
@Dzahn noticed that there were kafka-jumbo restarts around the same time and the alert pointed to port 50010 which shows up in Puppet as Hadoop data-note.
The alert self-resolved which makes sense as the traffic started going down but I am filing this task to see if we can make sense of what happened here so that we can prevent it from happening again, if required and if that makes sense.
Thanks!