We currently import quite a few cloud / ISP ip ranges using the cronjob based on the script in operations/puppet
modules/external_clouds_vendors/files/fetch_external_clouds_vendors_nets.py
which uses either ip ranges provided by the providers themselves, or queries the RIPE database.
I'd like to expand the approach to import either all, or a large subset, of the ISP IP ranges, but querying maxmind's database, and then make it available to use as filters in requestctl.
Ideally, we would store most of these isps as ipblock objects under isp/<name>, but we would separate the cloud providers to go under cloud/<something> via a configured list of names.
We would then map in haproxy for each source ip, if it's a cloud or an ISP, and the corresponding name.
My proposal is to add a new header, that we might call x-requestctl-prov with the content defined as a single tag for matches in the known-clients, cloud, isp scopes, matched in that order. So for example, a request from the googlebot ip space will be:
x-requestctl-prov: known-client/googlebot
because it will match the known-client map and we won't do any further matching.
A request from an unknown client in AWS will look like:
x-requestctl-prov: cloud/aws
and so on.
Eventually, use of this header will supplant the current use of ipblocks for known-clients and the X-Public-Cloud header in requestcl and elsewhere.
There is are a couple questions to answer:
- Can we efficiently map all isps using haproxy maps?
- If not, can we map the top-N in terms of traffic we see?
- If not, can we do any of the above in varnish using netmaps?