Page MenuHomePhabricator

Return traffic to eqiad WMCS triggering FNM
Closed, ResolvedPublic

Description

Since Fastnetmon was deployed, we got a few false positive about:

Possible DDoS to 185.15.56.1

185.15.56.1 being WMCS gateway's main IP.
The reason is that something in WMCS is periodically and heavily downloading something from the Internet.
There is nothing wrong with that, but the return traffic is high enough to trigger FNM.

(optional) It might be useful for WMCS to check if this spike is not saturating anything in their infra (just in case)

Then we can either increase the Global FNM thresholds (easy) (eg. https://github.com/wikimedia/puppet/blob/production/modules/fastnetmon/templates/fastnetmon.conf.erb#L52 )

Or (more complex and it introduces snowflakes) setup custom thresholds for that IP (or any IPs in the WMCS range), see https://github.com/pavel-odintsov/fastnetmon/blob/master/src/fastnetmon.conf#L262
But probably better on the long run, (eg. have different thresholds for LVS VIPs VS. regular servers)

Or (less preferred) whitelist that IP (or range) to not be monitored (cf. https://github.com/pavel-odintsov/fastnetmon/blob/master/src/fastnetmon.conf#L43 )

Details

Related Gerrit Patches:
operations/puppet : productionFastnetmon: add thresholds overrides

Event Timeline

ayounsi triaged this task as Low priority.Dec 15 2019, 11:58 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2019, 11:58 AM
Krinkle updated the task description. (Show Details)Dec 15 2019, 8:19 PM

FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how far that gets us.

aborrero added a subscriber: aborrero.

FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how far that gets us.

+1 to this.

For the record, that link/NIC is 10G, in case it matters for any configuration about thresholds.

The last fastnetmon warning was this:

  • event start 2019-12-16 04:43 UTC
  • event end 2019-12-16 05:15 UTC
  • average incoming traffic: 606 mbps (for 185.15.56.1)
  • average network incoming traffic: 6 mbps (for 185.15.56.0/24)

This more or less match what we have in grafana.

https://grafana.wikimedia.org/d/000000571/cloudvps-eqiad1?orgId=1&from=1576470794630&to=1576473413515

Worth noting that previous fastnetmon alerts for this IP seem to be for relatively similar average numbers.
The problem may with the averages? According to prometheus, our average may be something like 300Mbps or 400Mbps.

Anyway I've seen before other peaks of several Gbps (up to 4Gbps) but they are lost now apparently in the averaging / resolution issues for old metrics.
That was before we had fastnetmon I guess.

Thanks for the feedback! Next question is about the scope, we can either:

1/ Hardcode that exception (and maybe a few others down the road) in the configuration (or better, in Hiera), and call it a day

hostgroup = wmcs_eqiad_gw:185.15.56.1/32
wmcs_eqiad_gw_enable_ban = on
wmcs_eqiad_gw_ban_for_pps = on
wmcs_eqiad_gw_ban_for_bandwidth = on
wmcs_eqiad_gw_ban_for_flows = off
wmcs_eqiad_gw_threshold_pps = 200000
wmcs_eqiad_gw_threshold_mbps = 1000

2/ Automatically generate hostgroups based on Puppet's modules/network/data/data.yaml
For example one for all LVS VIPs

I can imagine a final state where we have different thresholds for:

  • LVS (high mbps/pps)
  • AuthDNS (high pps, low mbps)
  • Special exceptions (eg. WMCS gateway, Tor)
  • All other regular public facing hosts (tbd)
  • Network infrastructure (low pps/low mbps)

But that might be overkill instead of just defining a few exceptions.

Thoughts?

on our side, I think either 1) or 2) would work just fine.

Perhaps start with 1) and see if 2) makes sense for a next iteration.

I guess parsing past alerts can help answer this question? If the number of exceptions that need to be defined is small enough it might not be worth it to currently invest in 2) (at least not right now).

Good point! Looking at past alerts only that one was a false positive.

  1. would allow us to have stricter thresholds, but I agree that it's outside the scope here.

+1 to doing #1 and revisiting if it becomes a problem again.

Change 559125 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Fastnetmon: add thresholds overrides

https://gerrit.wikimedia.org/r/559125

Change 559125 merged by Ayounsi:
[operations/puppet@production] Fastnetmon: add thresholds overrides

https://gerrit.wikimedia.org/r/559125

ayounsi closed this task as Resolved.Dec 20 2019, 1:22 PM
ayounsi claimed this task.

All good!