Page MenuHomePhabricator

varnish filtering: should we automatically update public_cloud_nets
Open, MediumPublic

Description

Currently we have a hiera key abuse_networks['public_cloud_nets'] which is used in activity used in varnish to provide some rate limiting. As IP allocations for theses big cloud providers change some what frequently i wonder if we should put something in place to automate refreshing this data. The current data suggests it was " generated on 2019-12-30 "

Event Timeline

jbond triaged this task as Medium priority.Dec 17 2020, 2:48 PM
jbond created this task.
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptDec 17 2020, 2:48 PM

AWS allow to subscribe to the modification of the list fwiw, see https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html#subscribe-notifications
Google cloud compute is a bit less structured, see https://cloud.google.com/vpc/docs/vpc#manually_created_subnet_ip_ranges (under the Restricted ranges paragraph, link to ipranges/goog.txt.

Volans renamed this task from varnihs filtering: should we automaticly update public_cloud_nets to varnish filtering: should we automatically update public_cloud_nets .Dec 17 2020, 2:57 PM

2 other options:

  • Define a list of ASNs and get the matching prefixes from BGP (or API like RIPE stats)
  • Define a list of ASNs and get the matching prefixes from MaxMind DBs

I like the 2nd as we already have the tooling around it, and it doesn't require regularly fetching data from URLs that could change/break.

A downside, for example with Google is that it will most likely include crawlers IPs

A downside, for example with Google is that it will most likely include crawlers IPs

I'm also worried about cases where the ASN IP space includes things like all their MXes, or their corporate workstation IP space as well. This is true of multiple cloud providers.

We might have to implement a few different scrape approaches...

There is this script for AWS that @ema pointed me towards:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/utils/vcl_ec2_nets.py

I took some inspiration from that and made this for Azure after a chat on irc:

https://phabricator.wikimedia.org/P15965

But it's a bit unwieldy, given how Microsoft expose the data. And I am probably wildly outside code style and what libraries should use.

A few points:

  • Microsoft expose the data, but they don't provide consistent URLs for the JSON files, so you need to parse their HTML to get those which is a pain.
  • They do provide some metadata on the ranges, and those with a "systemService" tag appear to be the ones used by internal Azure functions (rather than customer VMs etc.)
  • I've tried to remove those, but there may be overlap between some of these and ranges that appear elsewhere. There are definitely overlapping ranges in the dataset.
  • The data is just for Azure, so it should not include any IP space announced by Microsoft but used for something else (to Chris's point).
  • The IPv6 ranges are there too, but copying from the AWS script I didn't print them.

Nice work :)

I have just noticed that this script outputs formate designed for varnihs, however we now generate this acl in puppet based on the abuse_networks block in /srv/private/hieradata/common.yaml. should probably update that one to output yaml format or directly update /srv/private/hieradata/common.yaml

I took some inspiration from that and made this for Azure after a chat on irc:

https://phabricator.wikimedia.org/P15965

To make review easier it would be useful to upload this to gerrit perhaps utils/azure_networks.py. in hte mean time i have sent some suggestions on the paste

But it's a bit unwieldy, given how Microsoft expose the data.

yes its insane they don't have an api for this

And I am probably wildly outside code style and what libraries should use.

Once in Gerrit CI and riccardo will surely pick up most of theses :)

  • Microsoft expose the data, but they don't provide consistent URLs for the JSON files, so you need to parse their HTML to get those which is a pain.

Have suggested a potentially easier tage to search for bu YMMV

  • They do provide some metadata on the ranges, and those with a "systemService" tag appear to be the ones used by internal Azure functions (rather than customer VMs etc.)
  • I've tried to remove those, but there may be overlap between some of these and ranges that appear elsewhere. There are definitely overlapping ranges in the dataset.

I think we can live with a few minor false posatives, the rate limiting put in place for theses IP ranges is fairly lite

  • The data is just for Azure, so it should not include any IP space announced by Microsoft but used for something else (to Chris's point).
  • The IPv6 ranges are there too, but copying from the AWS script I didn't print them.

We can support ipv6 in the abuse_networks yaml block so no need to filter theses out.

This script is a good start, we also need to think about how we update the yaml file in the private repo. In the first instance i think a script which yuo run manually would be a good start. We can then integrate the aws script and thinkabout if we should automate this. Also thinking outloud is this something we could/should add to netbox and then generate the yaml structures from there using e.g. the netbox/puppet integration?

Thanks jbond appreciate the feedback.

Your improvements to the script look great. Nice work on the parsing, much cleaner than my shite, and the single loop and creating a separate set of exclusions makes perfect sense.

One thing I do think we should include is some sort of IP aggregation, which I notice isn't in the updated script. Running it with aggregation yields 1540 IPv4 prefixes, as against 4132 without (there is a high level of redundancy / overlap in the data as mentioned). So it'll help performance wise to reduce it down.

I'm not sure if Netbox is the right place to *store* this data, but happy to discuss. You folk know better how we use the different tools, data sources etc. For now I'll update the script with your improvements and submit to Gerrit so it's there for further discussion.

One thing I do think we should include is some sort of IP aggregation

completely agree, its an oversight that it missed

I'm not sure if Netbox is the right place to *store* this data, but happy to discuss.

i honestly don't know either @Volans @ayounsi

I'm not sure if Netbox is the right place to *store* this data, but happy to discuss. You folk know better how we use the different tools, data sources etc. For now I'll update the script with your improvements and submit to Gerrit so it's there for further discussion.

AFAICT right now those lives in the private puppet repository. And right now we don't have a standard way to update it in a programmable way.
I personally don't see Netbox as the right place for those, at least as prefixes, for a couple of reasons:

  • Adding all large public clouds prefixes means to add hundreds of prefixes that will pollute the UI and require to always filter things when looking at it.
  • The only "sane" way I see to add this data to Netbox is to add the public clouds as tenants and then the prefixes as prefixes. But that would mean having a lot of data that we don't own.

The other options to use Netbox anyway, but not storing them as prefixes, could be to use a config context or a custom script/plugin that caches the data so that we can poll it without refreshing it and then have some timer that refreshes it.

I think that to decide where is the best place to save the data depends on where we need to use it. If only in the CDN configuration via Puppet for now is probably better to keep it in hiera, if we need that data on the network devices and hence in Homer too, then maybe Netbox is a better place and we can add those to the things we want to be able to read in Puppet from Netbox.

I personally don't see Netbox as the right place for those, at least as prefixes

Ack, I think in thtat case the best way forward for now is to create a script one runs manually to update the the puppet private repo. Longer term i think working on T270618 to create a more generic strategy for managing the block list would where to best spend effort

I still think filtering public clouds on their ASN (with MaxMind DB) is the most sustainable path until T270618.
Having to maintain multiple scripts for multiple providers is quickly going to be a hassle, and that's for providers sharing one way or the others their IP ranges, which is not the case for most of them. Furthermore, those lists don't seem great at separating DNS, corp, MX, from customer IPs, which will require to be curated manually.

As the blocking is only at the varnish layer, I don't think MXs are an issue (nor DNS), corporate workstations now use IPv6.
Even if rate limiting (or blocking) their traffic in case of an attack from their network doesn't seem harsh to me. The returned message needs to be more verbose than just "Too Many Requests".
Then if *really* needed, managing a few entries whitelist (eg. crawlers UAs), will be much easier than a thousands entries blacklist.

I fear we could be quite disappointed about "corporate workstations" being on IPv6 if we went to look ;) Either way I assume we want this list for v4 and v6? So that won't make any real difference?

I'd totally agree though, eyeball networks are what we want to avoid, and neither MS/AWS/GOOGLE represent those, excluding their own internal offices maybe. So any "collateral damage" from doing it on an ASN-wide basis is fairly small. And doing on that basis will definitely produce a much smaller filter list, which has obvious benefits.

That said if we do want to be more precise I don't believe it's that tricky to do this. The 3 largest cloud providers all provide such lists and are committed to doing so AFAIK. Even Microsoft, although for whatever reason don't provide a static link, make it freely available. The other two give you them directly, so it's not a massive challenge:

AWS: https://ip-ranges.amazonaws.com/ip-ranges.json
Google: https://www.gstatic.com/ipranges/cloud.json

Furthermore, those lists don't seem great at separating DNS, corp, MX, from customer IPs, which will require to be curated manually.

I'm not sure what makes you say that? None of that should be included as these are supposed to be Azure/AWS/GCP-only lists. Not Microsoft/Amazon/Google.

Search is implementing a temporary reactive solution to https://phabricator.wikimedia.org/T284479, but will need the issue here regarding automatically maintained list of public cloud IPs resolved before we can implement a better long term solution that doesn't depend on manual reactive maintenance

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Although it does not do what we need, some logic to download the lists from multiple clouds can be gathered from this project: https://github.com/nccgroup/cloud_ip_ranges/blob/master/cloud_ip_ranges.py

brandon also just pointed me to git grep netmapper (from the puppet repo) and https://gerrit.wikimedia.org/g/operations/software/varnish/libvmod-netmapper which may be a better way to automatically update theses lists in varnish directly (i.e. move away from abus_networks hiera key)

Change 769132 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] (WIP) C:varnish: Add automatic cloud nets update

https://gerrit.wikimedia.org/r/769132

Change 769410 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] O:external_clouds_vendors: New module for fetching cloud networks

https://gerrit.wikimedia.org/r/769410

Change 769464 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Load public-clouds.json via netmapper

https://gerrit.wikimedia.org/r/769464

Change 769469 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] varnish: create rate limit keyed on the cloud provider

https://gerrit.wikimedia.org/r/769469

Change 769410 merged by Jbond:

[operations/puppet@production] O:external_clouds_vendors: New module for fetching cloud networks

https://gerrit.wikimedia.org/r/769410

Change 769667 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] C:varnish: add the external cloud vendors file to the cache clusters

https://gerrit.wikimedia.org/r/769667

Change 769667 merged by Giuseppe Lavagetto:

[operations/puppet@production] C:varnish: add the external cloud vendors file to the cache clusters

https://gerrit.wikimedia.org/r/769667

Change 769132 abandoned by Jbond:

[operations/puppet@production] C:varnish: Add the external_cloud_vendors module to the cache clusters

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/769667

https://gerrit.wikimedia.org/r/769132

Change 769464 merged by Giuseppe Lavagetto:

[operations/puppet@production] C:varnish: Load public-clouds.json via netmapper

https://gerrit.wikimedia.org/r/769464

Change 775360 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] external_clouds_vendors: Add Linode

https://gerrit.wikimedia.org/r/775360

Change 775360 merged by RLazarus:

[operations/puppet@production] external_clouds_vendors: Add Linode

https://gerrit.wikimedia.org/r/775360

Change 779145 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] sretest: Uninstall external_clouds_vendors

https://gerrit.wikimedia.org/r/779145

Change 779146 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] sretest: Remove absented external_clouds_vendors

https://gerrit.wikimedia.org/r/779146

Change 779145 merged by RLazarus:

[operations/puppet@production] sretest: Uninstall external_clouds_vendors

https://gerrit.wikimedia.org/r/779145

Change 779146 merged by RLazarus:

[operations/puppet@production] sretest: Remove absented external_clouds_vendors

https://gerrit.wikimedia.org/r/779146

I found a bit of time to play with some of the above mentioned solutions and those are my findings.

Get data from RIPE

That's totally possible and easy using RIPE's API. I've sent a patch to do the fetch of all prefixes given an AS Number in https://gerrit.wikimedia.org/r/c/operations/puppet/+/956955

PRO: easy, automatically updated
CON: depends on an external API that could be subject to changes, issues, etc.

Get data from MaxMind DB

AFAICT there are no MaxMind DB that offer the mapping that we need, but we can extract the relevant information from the DBs that we have. Unfortunately the Python library doesn't offer a way to iterate the whole database, but there is a GitHub issue (open since 2016) with a proposed patch from Faidon (in 2018) that still works as expected.
With that patch in 2.5 minutes and ~750MB of RAM I was able to get a mapping ASN -> set(prefixes) and with an additional minute and ~200MB of RAM I got the CIDR merged like we do in the existing methods to gather this kind of data.
For more detailed information on the extracted data see this paste (NDA only): P52496.

This is probably still too much data to load into varnish but could still be used on-demand when we need to get prefixes for ASNs that do not publish them.

PRO: not too complex, uses local data and doesn't depend on external services
CON: requires more computational time, but it will be just twice a week

Questions

I see that Varnish has the capability of reading mmdb databases (see https://docs.varnish-software.com/varnish-enterprise/vmods/geolocation/ ), and I was wondering if that could be used or is too slow.
Because if that could be used it could simplify a lot of things.
Surely the list of prefixes that are officially published are qualitatively better, but are just a handful of cases while this is global.

The problem getting them by ASN is that there may be "collateral damage" sometimes. i.e. If you pull the routes for Google's ASN you'll get GCP routes and also those used by Youtube. Similar story for Microsoft services that aren't Azure etc. May or may not be an issue depending on the particular company. Luckily most of these networks do not have "eyeball" networks mixed in so it may not matter.

In general the API endpoints used in the script you posted are the best way to get just the ranges used by the cloud endpoints.

Change 769469 abandoned by Jbond:

[operations/puppet@production] C:varnish: create rate limit keyed on the cloud provider

Reason:

we can now do this with requestctl

https://gerrit.wikimedia.org/r/769469