Page MenuHomePhabricator

varnish filtering: should we automatically update public_cloud_nets
Open, MediumPublic

Description

Currently we have a hiera key abuse_networks['public_cloud_nets'] which is used in activity used in varnish to provide some rate limiting. As IP allocations for theses big cloud providers change some what frequently i wonder if we should put something in place to automate refreshing this data. The current data suggests it was " generated on 2019-12-30 "

Event Timeline

jbond triaged this task as Medium priority.Dec 17 2020, 2:48 PM
jbond created this task.

AWS allow to subscribe to the modification of the list fwiw, see https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html#subscribe-notifications
Google cloud compute is a bit less structured, see https://cloud.google.com/vpc/docs/vpc#manually_created_subnet_ip_ranges (under the Restricted ranges paragraph, link to ipranges/goog.txt.

Volans renamed this task from varnihs filtering: should we automaticly update public_cloud_nets to varnish filtering: should we automatically update public_cloud_nets .Dec 17 2020, 2:57 PM

2 other options:

  • Define a list of ASNs and get the matching prefixes from BGP (or API like RIPE stats)
  • Define a list of ASNs and get the matching prefixes from MaxMind DBs

I like the 2nd as we already have the tooling around it, and it doesn't require regularly fetching data from URLs that could change/break.

A downside, for example with Google is that it will most likely include crawlers IPs

A downside, for example with Google is that it will most likely include crawlers IPs

I'm also worried about cases where the ASN IP space includes things like all their MXes, or their corporate workstation IP space as well. This is true of multiple cloud providers.

We might have to implement a few different scrape approaches...

There is this script for AWS that @ema pointed me towards:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/utils/vcl_ec2_nets.py

I took some inspiration from that and made this for Azure after a chat on irc:

https://phabricator.wikimedia.org/P15965

But it's a bit unwieldy, given how Microsoft expose the data. And I am probably wildly outside code style and what libraries should use.

A few points:

  • Microsoft expose the data, but they don't provide consistent URLs for the JSON files, so you need to parse their HTML to get those which is a pain.
  • They do provide some metadata on the ranges, and those with a "systemService" tag appear to be the ones used by internal Azure functions (rather than customer VMs etc.)
  • I've tried to remove those, but there may be overlap between some of these and ranges that appear elsewhere. There are definitely overlapping ranges in the dataset.
  • The data is just for Azure, so it should not include any IP space announced by Microsoft but used for something else (to Chris's point).
  • The IPv6 ranges are there too, but copying from the AWS script I didn't print them.

Nice work :)

I have just noticed that this script outputs formate designed for varnihs, however we now generate this acl in puppet based on the abuse_networks block in /srv/private/hieradata/common.yaml. should probably update that one to output yaml format or directly update /srv/private/hieradata/common.yaml

I took some inspiration from that and made this for Azure after a chat on irc:

https://phabricator.wikimedia.org/P15965

To make review easier it would be useful to upload this to gerrit perhaps utils/azure_networks.py. in hte mean time i have sent some suggestions on the paste

But it's a bit unwieldy, given how Microsoft expose the data.

yes its insane they don't have an api for this

And I am probably wildly outside code style and what libraries should use.

Once in Gerrit CI and riccardo will surely pick up most of theses :)

  • Microsoft expose the data, but they don't provide consistent URLs for the JSON files, so you need to parse their HTML to get those which is a pain.

Have suggested a potentially easier tage to search for bu YMMV

  • They do provide some metadata on the ranges, and those with a "systemService" tag appear to be the ones used by internal Azure functions (rather than customer VMs etc.)
  • I've tried to remove those, but there may be overlap between some of these and ranges that appear elsewhere. There are definitely overlapping ranges in the dataset.

I think we can live with a few minor false posatives, the rate limiting put in place for theses IP ranges is fairly lite

  • The data is just for Azure, so it should not include any IP space announced by Microsoft but used for something else (to Chris's point).
  • The IPv6 ranges are there too, but copying from the AWS script I didn't print them.

We can support ipv6 in the abuse_networks yaml block so no need to filter theses out.

This script is a good start, we also need to think about how we update the yaml file in the private repo. In the first instance i think a script which yuo run manually would be a good start. We can then integrate the aws script and thinkabout if we should automate this. Also thinking outloud is this something we could/should add to netbox and then generate the yaml structures from there using e.g. the netbox/puppet integration?

Thanks jbond appreciate the feedback.

Your improvements to the script look great. Nice work on the parsing, much cleaner than my shite, and the single loop and creating a separate set of exclusions makes perfect sense.

One thing I do think we should include is some sort of IP aggregation, which I notice isn't in the updated script. Running it with aggregation yields 1540 IPv4 prefixes, as against 4132 without (there is a high level of redundancy / overlap in the data as mentioned). So it'll help performance wise to reduce it down.

I'm not sure if Netbox is the right place to *store* this data, but happy to discuss. You folk know better how we use the different tools, data sources etc. For now I'll update the script with your improvements and submit to Gerrit so it's there for further discussion.

One thing I do think we should include is some sort of IP aggregation

completely agree, its an oversight that it missed

I'm not sure if Netbox is the right place to *store* this data, but happy to discuss.

i honestly don't know either @Volans @ayounsi

I'm not sure if Netbox is the right place to *store* this data, but happy to discuss. You folk know better how we use the different tools, data sources etc. For now I'll update the script with your improvements and submit to Gerrit so it's there for further discussion.

AFAICT right now those lives in the private puppet repository. And right now we don't have a standard way to update it in a programmable way.
I personally don't see Netbox as the right place for those, at least as prefixes, for a couple of reasons:

  • Adding all large public clouds prefixes means to add hundreds of prefixes that will pollute the UI and require to always filter things when looking at it.
  • The only "sane" way I see to add this data to Netbox is to add the public clouds as tenants and then the prefixes as prefixes. But that would mean having a lot of data that we don't own.

The other options to use Netbox anyway, but not storing them as prefixes, could be to use a config context or a custom script/plugin that caches the data so that we can poll it without refreshing it and then have some timer that refreshes it.

I think that to decide where is the best place to save the data depends on where we need to use it. If only in the CDN configuration via Puppet for now is probably better to keep it in hiera, if we need that data on the network devices and hence in Homer too, then maybe Netbox is a better place and we can add those to the things we want to be able to read in Puppet from Netbox.

I personally don't see Netbox as the right place for those, at least as prefixes

Ack, I think in thtat case the best way forward for now is to create a script one runs manually to update the the puppet private repo. Longer term i think working on T270618 to create a more generic strategy for managing the block list would where to best spend effort

I still think filtering public clouds on their ASN (with MaxMind DB) is the most sustainable path until T270618.
Having to maintain multiple scripts for multiple providers is quickly going to be a hassle, and that's for providers sharing one way or the others their IP ranges, which is not the case for most of them. Furthermore, those lists don't seem great at separating DNS, corp, MX, from customer IPs, which will require to be curated manually.

As the blocking is only at the varnish layer, I don't think MXs are an issue (nor DNS), corporate workstations now use IPv6.
Even if rate limiting (or blocking) their traffic in case of an attack from their network doesn't seem harsh to me. The returned message needs to be more verbose than just "Too Many Requests".
Then if *really* needed, managing a few entries whitelist (eg. crawlers UAs), will be much easier than a thousands entries blacklist.

I fear we could be quite disappointed about "corporate workstations" being on IPv6 if we went to look ;) Either way I assume we want this list for v4 and v6? So that won't make any real difference?

I'd totally agree though, eyeball networks are what we want to avoid, and neither MS/AWS/GOOGLE represent those, excluding their own internal offices maybe. So any "collateral damage" from doing it on an ASN-wide basis is fairly small. And doing on that basis will definitely produce a much smaller filter list, which has obvious benefits.

That said if we do want to be more precise I don't believe it's that tricky to do this. The 3 largest cloud providers all provide such lists and are committed to doing so AFAIK. Even Microsoft, although for whatever reason don't provide a static link, make it freely available. The other two give you them directly, so it's not a massive challenge:

AWS: https://ip-ranges.amazonaws.com/ip-ranges.json
Google: https://www.gstatic.com/ipranges/cloud.json

Furthermore, those lists don't seem great at separating DNS, corp, MX, from customer IPs, which will require to be curated manually.

I'm not sure what makes you say that? None of that should be included as these are supposed to be Azure/AWS/GCP-only lists. Not Microsoft/Amazon/Google.

Search is implementing a temporary reactive solution to https://phabricator.wikimedia.org/T284479, but will need the issue here regarding automatically maintained list of public cloud IPs resolved before we can implement a better long term solution that doesn't depend on manual reactive maintenance

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Although it does not do what we need, some logic to download the lists from multiple clouds can be gathered from this project: https://github.com/nccgroup/cloud_ip_ranges/blob/master/cloud_ip_ranges.py

brandon also just pointed me to git grep netmapper (from the puppet repo) and https://gerrit.wikimedia.org/g/operations/software/varnish/libvmod-netmapper which may be a better way to automatically update theses lists in varnish directly (i.e. move away from abus_networks hiera key)

Change 769132 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] (WIP) C:varnish: Add automatic cloud nets update

https://gerrit.wikimedia.org/r/769132

Change 769410 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] O:external_clouds_vendors: New module for fetching cloud networks

https://gerrit.wikimedia.org/r/769410

Change 769464 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Load public-clouds.json via netmapper

https://gerrit.wikimedia.org/r/769464

Change 769469 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] varnish: create rate limit keyed on the cloud provider

https://gerrit.wikimedia.org/r/769469

Change 769410 merged by Jbond:

[operations/puppet@production] O:external_clouds_vendors: New module for fetching cloud networks

https://gerrit.wikimedia.org/r/769410

Change 769667 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] C:varnish: add the external cloud vendors file to the cache clusters

https://gerrit.wikimedia.org/r/769667

Change 769667 merged by Giuseppe Lavagetto:

[operations/puppet@production] C:varnish: add the external cloud vendors file to the cache clusters

https://gerrit.wikimedia.org/r/769667

Change 769132 abandoned by Jbond:

[operations/puppet@production] C:varnish: Add the external_cloud_vendors module to the cache clusters

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/769667

https://gerrit.wikimedia.org/r/769132

Change 769464 merged by Giuseppe Lavagetto:

[operations/puppet@production] C:varnish: Load public-clouds.json via netmapper

https://gerrit.wikimedia.org/r/769464

Change 775360 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] external_clouds_vendors: Add Linode

https://gerrit.wikimedia.org/r/775360

Change 775360 merged by RLazarus:

[operations/puppet@production] external_clouds_vendors: Add Linode

https://gerrit.wikimedia.org/r/775360

Change 779145 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] sretest: Uninstall external_clouds_vendors

https://gerrit.wikimedia.org/r/779145

Change 779146 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] sretest: Remove absented external_clouds_vendors

https://gerrit.wikimedia.org/r/779146

Change 779145 merged by RLazarus:

[operations/puppet@production] sretest: Uninstall external_clouds_vendors

https://gerrit.wikimedia.org/r/779145

Change 779146 merged by RLazarus:

[operations/puppet@production] sretest: Remove absented external_clouds_vendors

https://gerrit.wikimedia.org/r/779146