Page MenuHomePhabricator

Primary outbound port utilisation over 80% alert muted
Closed, ResolvedPublic

Description

I've muted the Primary outbound port utilisation over 80% alert; since 21:21:05 on 2024-02-25 this has been repeatedly firing and resolving for cr2-codfw.

Looking at superset the extra traffic is largely from AS32934 (FB), but the people available didn't feel confident trying to work out which IP ranges to ratelimit with requestctl (and you can't use as_number as a filter). So we took the decision to mute the alert rather than having it keep paging on a Sunday evening, on the basis that if this becomes a wider issue more pages will fire.

This will want putting back on Monday, though, I suspect.
Update 2024-02-26: Alert enabled again.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 1006472 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Block requests from "facebookexternalhit" UA

https://gerrit.wikimedia.org/r/1006472

Change 1006472 merged by Ayounsi:

[operations/puppet@production] Block requests from "facebookexternalhit" UA

https://gerrit.wikimedia.org/r/1006472

Mentioned in SAL (#wikimedia-operations) [2024-02-26T08:51:21Z] <XioNoX> deploy "facebookexternalhit" varnish 403 - T358455

Peering link to DE-CIX on cr2-codfw was saturating, deployed the patch above to fix the immediate issue.

We also have a PNI to Meta on cr2-eqdfw which wasn't working as expected because BGP sessions to meta were also configured on cr2-codfw. This is now fixed.

but active/passive 10G links between codfw and eqdfw means that pushing 10G to Meta from codfw would cause an even later outage as all the transit/peering links on cr2-eqdfw would be impacted.

Multiple paths from here :

  • Get in touch with FB
  • Keep the current 403
  • Fine tune a rate-limit
  • Increase link capacity

Mentioned in SAL (#wikimedia-operations) [2024-02-26T09:23:51Z] <Emperor> unmute the outbound port utilisation over 80% alert T358455

FWIW the bulk of the bandwidth usage is generated by only 15 IPv6s that are downloading mostly .webm videos, see the already filtered superset dashboard.

This would best be fixed by extending the haproxy bwlim work done in T317799 -- we've talked about having per-ASN limits in addition to the existing and partially-deployed per-file-URI limits.