Page MenuHomePhabricator

Map ISPs in Maxmind db, used in turnilo/superset, to use in requestctl rule
Closed, ResolvedPublic

Description

We currently import quite a few cloud / ISP ip ranges using the cronjob based on the script in operations/puppet

modules/external_clouds_vendors/files/fetch_external_clouds_vendors_nets.py

which uses either ip ranges provided by the providers themselves, or queries the RIPE database.

I'd like to expand the approach to import either all, or a large subset, of the ISP IP ranges, but querying maxmind's database, and then make it available to use as filters in requestctl.

Ideally, we would store most of these isps as ipblock objects under isp/<name>, but we would separate the cloud providers to go under cloud/<something> via a configured list of names.

We would then map in haproxy for each source ip, if it's a cloud or an ISP, and the corresponding name.

My proposal is to add a new header, that we might call x-requestctl-prov with the content defined as a single tag for matches in the known-clients, cloud, isp scopes, matched in that order. So for example, a request from the googlebot ip space will be:
x-requestctl-prov: known-client/googlebot
because it will match the known-client map and we won't do any further matching.
A request from an unknown client in AWS will look like:
x-requestctl-prov: cloud/aws

and so on.

Eventually, use of this header will supplant the current use of ipblocks for known-clients and the X-Public-Cloud header in requestcl and elsewhere.

There is are a couple questions to answer:

  • Can we efficiently map all isps using haproxy maps?
  • If not, can we map the top-N in terms of traffic we see?
  • If not, can we do any of the above in varnish using netmaps?

Event Timeline

I'm currently performing some tests on HAProxy to check how much it takes for a lookup using a file map built from maxmind GeoIP-ISP database where the first colum is the network and the second is the (normalized) ISP name.

Apparently there's no penalty in this but I have to replicate it fetching multiple keys instead of just one and with a higher traffic load.

I've noticed however a longer startup time in HAProxy process restart, probably due to the fact that it has to verify the entries (eg. if there are errors in subnet format it refuses to start with an error) in the map file.

The map file generated this way is ~42M in size and has 1,257,755 lines

If this is considered feasible a possible workflow would be:

  • [puppetserver] [already present]: systemd timer to download required maxmind geoip databases
  • [puppetserver] [already present]: systemd timer to download various configured cloud ip lists
  • [puppetserver] [todo]: timer to build from maxmind and other sources a unique HAProxy map file with format <subnet>\t<prefix>/<name>, eg. 1.2.3.4 isp/my-isp or 2.3.4.5 cloud/my-cloud. Map file is located under puppetserver (private?) fileserver dir
  • [cache host] puppet fetches map file and saves in usual HAProxy configuration directory
  • [cache host] HAProxy directive to add request header (or other actions) based on the map, eg.
http-request set-var(txn.requestctl_prov) src,map_ip(/etc/haproxy/mapfile.lst)
http-request set-header X-Requestctl-Prov %[var(txn.requestctl_prov)]

Other possibilities to achieve the same result are:

Fabfur changed the task status from Open to In Progress.May 8 2025, 8:04 AM

Change #1146970 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: use maxmind lua bindings to lookup client ISP

https://gerrit.wikimedia.org/r/1146970

Change #1146970 merged by Fabfur:

[operations/puppet@production] haproxy: use maxmind lua bindings to lookup client ISP

https://gerrit.wikimedia.org/r/1146970

Change #1150591 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value

https://gerrit.wikimedia.org/r/1150591

Change #1150690 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: enable maxmind isp lookup on cp7001

https://gerrit.wikimedia.org/r/1150690

Change #1150591 merged by Fabfur:

[operations/puppet@production] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value

https://gerrit.wikimedia.org/r/1150591

Change #1150690 merged by Fabfur:

[operations/puppet@production] hiera: enable maxmind isp lookup on cp7001

https://gerrit.wikimedia.org/r/1150690

Change #1150724 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: re-enable maxmind lookup on cp7001

https://gerrit.wikimedia.org/r/1150724

Change #1150724 merged by Fabfur:

[operations/puppet@production] hiera: re-enable maxmind lookup on cp7001

https://gerrit.wikimedia.org/r/1150724

Change #1150727 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache: fixed maxmind lua fetcher script

https://gerrit.wikimedia.org/r/1150727

Change #1150727 merged by Fabfur:

[operations/puppet@production] cache: fixed maxmind lua fetcher script

https://gerrit.wikimedia.org/r/1150727

Change #1151011 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: truncate isp name to 64 bytes

https://gerrit.wikimedia.org/r/1151011

Change #1151011 merged by Fabfur:

[operations/puppet@production] haproxy: truncate isp to 64 bytes, lowecase and change header name

https://gerrit.wikimedia.org/r/1151011