Page MenuHomePhabricator

Create a mechanism that allows fetching geolocation and subnet data for IP addresses
Open, HighPublic

Description

Extensions that have to do with a user's IP (such as CheckUser and MediaWiki-extensions-LoginNotify ) would benefit if we could show the "geolocation" of an IP address along with information about its "IP subnet". For instance, the IP address 8.8.8.8 belongs to Google Inc. and is located in Mountain View, CA, and it belongs to the IPv4 subnet 8.8.8.0/24.

  • In case of CheckUser, if I retrieve results for a user and he has edits from 8.8.8.1, 8.8.8.2, 8.8.8.3, 8.8.8.11, etc, I would love to know that they are all from the same IPv4 subnet. Currently, CheckUser does not provide that information and I have to retrieve the data using third party resources.
  • In case of LoginNotify, this extension warns me if I log in from a "new" IP address. It assumes that an IP address is "known" if it is within the same /24 subnet of my other known IPs. But this assumption is inaccurate as many IPs belong to wider subnets like /16 or /21 and I would like not to get false notifications in such circumstances.

There exists at least one service provider (https://ipinfo.io) that provides all the information we need for this task through an API (geolocation, ISP name, IP subnet, etc.) However, we need to obtain a license from them (both for legal reasons as well as the fact that IP sunbet data is not available for free through the API).

Alternatively, we can obtain the data in a single dump (not through an API). No matter how the data is retrieved, we would like to have an extension that simplifies obtaining, re-obtaining (e.g. every three months), storing, and returning this data.

Data source, retrieval and update are to be discussed in the parent task. This task focuses on technical aspects (DB schema, classes and methods, etc)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For the backend, I don't think we need an extension. The steps I see are:

  1. Get ops to install the free maxmind geolocation database on MediaWiki app servers (the free version so we can expose it publicly to all users)
  2. Get https://github.com/maxmind/GeoIP2-php security reviewed (Security-Team-Reviews) as an optional dependency of MediaWiki core
  3. Write a small class around the geoip2 library in MediaWiki core and use it in RangeContributions.
  4. Use that class in CheckUser, and anywhere else as needed.

If that sounds reasonable, we probably want to get ops to signoff on the first step before going ahead with the rest.

See also https://tools.wmflabs.org/whois/gateway.py, which pulls from an array of providers that don't appear to be paid.

See also https://tools.wmflabs.org/whois/gateway.py, which pulls from an array of providers that don't appear to be paid.

If we want to be able to use this for batch lookups like on Special:RangeContributions or Special:CheckUser, then I don't think calling an external service will work. Also I think there's a privacy issue of telling an external service what IPs we're looking up. We also have maxmind/geoip stuff already puppetized and there are decent looking PHP libraries for it so I think it's a bit more realistic.

BBlack added a subscriber: BBlack.Sep 6 2017, 1:58 PM

So, just to re-confirm a couple points above:

  1. I'm pretty sure we don't want to be passing user PII, including IPs, to third parties, so yes the lookup should be in-house.
  2. I'm pretty sure that yes, you can do this sort of thing if you stick to the (less-often-updated + less-accurate) GeoLite2 database, as opposed to the commercial variants we license.

However, note that there are only three GeoLite2 databases available: Country, City, and ASN. Both of the use-cases mentioned at the top seem focused on the idea of getting accurate subnet information as an improvement over "assume IPv4/24". The Country and City databases wouldn't help with that problem at all really, although I could see related use-cases for CheckUser and LoginNotify to use this geolocation information to tag edits with geolocation information and inform that you're logging in from a previously-unseen locality, respectively.

Only the ASN data comes close to solving the subnetting problem, but even then it's not a great solution. ASNs are in many cases larger in scope than the kind of association you're looking for, as many organizations purchase un-portable subnets from upstream providers. For example, there might be thousands of distinct subnets belonging to distinct entities/businesses across the US that would all lookup to the same Comcast ASN, as they've all purchased un-portable ranges from Comcast as their provider. In that same ASN there might also be individual home subscribers hundreds of miles apart from each other in geographic terms.

In any case, it's possible this whole ticket is a victim of an X-Y Problem. Maybe rewind to the initial problems and assumptions and broaden the search for solutions a bit?

Huji added a comment.EditedSep 6 2017, 2:42 PM
  1. I'm pretty sure we don't want to be passing user PII, including IPs, to third parties, so yes the lookup should be in-house.

Yes, but currently (and for the past 15 years or however long CU has existed) we have inevitably passed user IPs to third parties, whenever we have run an IP WHOIS anywhere online. Of course, the IP is not tied to the user (from the third party's perspective, you are just running WHOIS on one or more IPs) but in theory, the third party could assume that if I do WHOIS on 5 different IP addresses within a short time, they probably are related to each other somehow. (The third party has no way to know it has to do with a Wikipedia user either, as it does not know who I am. I am just a random person running a bunch of IP WHOIS queries).

So, IMHO, we should strive to do an in-house WHOIS query (hence my proposal to create an extension to help with that) but we cannot "require" this until we have a fully functional solution that addresses our needs (plural).

However, note that there are only three GeoLite2 databases available: Country, City, and ASN. Both of the use-cases mentioned at the top seem focused on the idea of getting accurate subnet information as an improvement over "assume IPv4/24".

You are completely correct. I need to know the subnet, as often this information helps me find the "open proxy" range I need to block. This is one of those "needs" I just mentioned above. "Assume /24" does not qualify as good enough. If we make an extension that often gives me inaccurate location and unreliable subnet information, I will end up continuing to do IP WHOIS online, which as I said above is inevitable until we have a satisfactory solution for it.

In any case, it's possible this whole ticket is a victim of an X-Y Problem. Maybe rewind to the initial problems and assumptions and broaden the search for solutions a bit?

I don't think so. This is a task about an extension. It focuses on creating the classes and methods + creating a DB schema.

Whether we fill in that DB with low quality data or high quality data, free data or paid data, etc is beyond this Task. In the case of WMF, it is more within the scope of Security-Team-Reviews and in the case of non-WMF wikis it is up to the wiki owners to choose were to find the data, how often to update it, etc. (let's not forget MW is not just made for WMF)

But the purpose of this task is for instance to agree that we need "city", "country", "ISP" and "subnet" fields in our DB schema.

Huji added a comment.Sep 6 2017, 2:49 PM

For the backend, I don't think we need an extension.

Whether to make it an extension or a core function is trivial, IMHO. But I would rather make it an extension as I am much more hopeful to get this idea to production on WMF wikis this way, than through adding a dependency in core. (The latter will take much longer, IMHO, and this is a feature we needed yesterday, or since 2008 for that matter, see T18068).

The steps I see are:

  1. Get ops to install the free maxmind geolocation database on MediaWiki app servers (the free version so we can expose it publicly to all users)
  2. Get https://github.com/maxmind/GeoIP2-php security reviewed (Security-Team-Reviews) as an optional dependency of MediaWiki core

See my comments immediately above. I think we need a parent task here for your numbers 1 and 2 above, and I will create it shortly.

Huji renamed this task from Create an extension that allows fetching geolocation and range data for IP addresses to Create an extension that allows fetching geolocation and subnet data for IP addresses.Sep 6 2017, 3:02 PM
Huji updated the task description. (Show Details)
Huji updated the task description. (Show Details)Sep 6 2017, 3:05 PM
Huji added a comment.Sep 6 2017, 3:57 PM

Another issue to consider: if we expose the geo-IP and subnet information to the end users of WMF wikis and the source of the data is something we need to obtain a license for, users could at least in theory abuse this and try to download the whole data from our servers. This means we need to have a way to throttle how often a user retrieves IP WHOIS data from our servers. The last thing you want is for someone to write a script that calls Special:RangeContributions for every /32 range and retrieves all of our data for free.

Which is yet another reason why I think an extension might be a better idea. There are so many levels of complexity to this (such as the issue of throttling) that I don't think we can easily deploy this as a change to MW core.

Hm, would it be helpful to focus on geolocation and subnet as different problems? Or should we theoretically be getting them from the same data source? I think the geolocation information is still useful to have, even without subnet info.

Another issue to consider: if we expose the geo-IP and subnet information to the end users of WMF wikis and the source of the data is something we need to obtain a license for, users could at least in theory abuse this and try to download the whole data from our servers. This means we need to have a way to throttle how often a user retrieves IP WHOIS data from our servers. The last thing you want is for someone to write a script that calls Special:RangeContributions for every /32 range and retrieves all of our data for free.

Earlier I suggested using a freely available data set without any license restrictions. As you point out, using something that requires a license is inherently complicated with regards to user access and throttling and I would rather avoid it entirely by using a freely usable dataset.

Which is yet another reason why I think an extension might be a better idea. There are so many levels of complexity to this (such as the issue of throttling) that I don't think we can easily deploy this as a change to MW core.

This is probably something we can discuss more once we've figured out where the data is coming from, but in my experience if a feature needs to integrate well into core features (like special pages) then it would likely do better inn MediaWiki core. And the deployment process for new core features is much easier than extensions.

Huji added a comment.Sep 6 2017, 5:57 PM

Hm, would it be helpful to focus on geolocation and subnet as different problems?
...
Earlier I suggested using a freely available data set without any license restrictions. As you point out, using something that requires a license is inherently complicated with regards to user access and throttling and I would rather avoid it entirely by using a freely usable dataset.

Correct me if I am wrong, but the subnet information is freely available, no? It does make sense to keep the subnet and location data in two separate tables, and update them at different frequencies though.

Huji added a comment.Sep 6 2017, 6:24 PM

Regarding the subnet data (or we should more appropriately call it the ASN data, short for Autonomous System Numbe), I was thinking that we could create a table like this:

subnetISPstart_ip_hexend_ip_hex
8.8.8.0/24Google Inc.08080800080808FF

Assuming we index the start and end IP hexes, a look up can be fairly easy.

As for location information, I think we can adopt MaxMind's schema:

subnetstart_ip_hexend_ip_hexcontinentcountry_iso_codecountry_namecity_name
8.8.0.0/180808000008083FFFNorth AmericaUSUnited States

I am dropping time_zone, metro_code, and subdivisions 1 & 2 (which generally map to things like states, counties, provinces, etc) as neither of the use cases mentioned in the task would need this level of detail.

Both examples above are based on MaxMind free data. This also illustrates that we *must* keep the ASN and location data separate, as the subnet definition in the location data may be inconsistent with that in ASN data.

I thin MaxMind free data would do the job for now, so let's start with that.

Only the ASN data comes close to solving the subnetting problem, but even then it's not a great solution. ASNs are in many cases larger in scope than the kind of association you're looking for, as many organizations purchase un-portable subnets from upstream providers. For example, there might be thousands of distinct subnets belonging to distinct entities/businesses across the US that would all lookup to the same Comcast ASN, as they've all purchased un-portable ranges from Comcast as their provider. In that same ASN there might also be individual home subscribers hundreds of miles apart from each other in geographic terms.
In any case, it's possible this whole ticket is a victim of an X-Y Problem. Maybe rewind to the initial problems and assumptions and broaden the search for solutions a bit?

Regarding subnet, I think the use case is that @Huji (and others) want to be able to make rangeblocks that will cover the IPs available to specific user, but at the same time not be larger than necessary to minimize collateral damage.

DannyH added a subscriber: DannyH.Oct 19 2017, 6:48 PM
Huji set the point value for this task to 5.Oct 26 2017, 2:00 PM
Huji added a project: User-Huji.
MaxSem added a subscriber: MaxSem.Oct 26 2017, 6:15 PM

I don't think this will work:

  • GeoIP databases aren't as accurate as people think.
  • They are even less accurate outside the US
  • Relying on GeoIP based locations can create serious problems
  • Maxmind databases are in English only. We don't have any means to translate them.
MaxSem removed the point value for this task.Oct 26 2017, 6:15 PM
Huji added a comment.Oct 26 2017, 8:26 PM

I don't think this will work:

  • GeoIP databases aren't as accurate as people think.
  • They are even less accurate outside the US
  • Relying on GeoIP based locations can create serious problems
  • Maxmind databases are in English only. We don't have any means to translate them.

I can think of even more reasons why this information can be wrong. For instance, the IP can be a proxy and the intruder's actual location may differ from what the IP shows.

But that has not stopped companies like Google, Facebook, etc. to use geoIP information for the same exact use case. And it shouldn't stop us. We can never have perfect data (many IPs are located in countries where political, financial, or privacy reasons don't allow an accurate location mapping). But we should not let perfect be the enemy of good.

Huji added a comment.Oct 26 2017, 8:31 PM

@Legoktm as you know, MaxMind's ASN data currently does not include the subnet. I asked them to add it to the ASN data, and they responded after a few weeks, saying that there are three types of subnet information, and asking which one we want:

1. The subnet value seen in the ISP/ASN CSV files. This is really a factor of how we build the binary database. It may line up with how an ISP is announcing their IP addresses, but isn't guaranteed to do so.
2. The prefixes announced by ISP via BGP. I would expect the subnets shown in the ASN CSV file to have a better chance at aligning with what prefixes ISP's are actually announcing, but that also isn't guaranteed.
3. The subnets that showup in WHOIS data, either from direct allocations by an RIR or reallocation by an ISP.

They think #3 is the best choice for us. Do you agree?

Huji renamed this task from Create an extension that allows fetching geolocation and subnet data for IP addresses to Create a mechanism that allows fetching geolocation and subnet data for IP addresses.Nov 1 2017, 12:53 AM

@Huji, yep #3 sounds good to me too.

Huji added a comment.Nov 1 2017, 2:09 AM

@Legoktm For the record, because #3 is not already in their CSV data, the folks at MaxMind said it will take months for them to introduce that data. I asked for #1 to be added, as an interim solution until they can introduce #3. They said that can probably be done in shorter term (weeks). I am still awaiting a timeline from them.

Tgr added a subscriber: Tgr.Jan 26 2018, 5:42 AM
Harej added a subscriber: Harej.Jan 29 2018, 8:21 PM

Can you clarify whether this is proposed as a new extension or as an enhancement to the existing CheckUser functionality?

Krinkle added a subscriber: Krinkle.Mar 8 2018, 3:46 AM

Change 376451 had a related patch set uploaded (by Huji; owner: Legoktm):
[mediawiki/core@master] Introduce GeoIPLookup and Special:AboutIP

https://gerrit.wikimedia.org/r/376451

Huji added a comment.Mar 8 2018, 4:10 AM

Some screenshots using the latest *free* MaxMind data:


The following caveats exist:

  1. MaxMind data is incomplete. For instance for IP 72.229.28.185 you get all four elements, but for 8.4.4.8 you don't get a city.
  2. MaxMind data is practically impossible to localize. We can likely translate continents and countries, but there are just too many cities and maintaining a localization for their names is not feasible.
  3. MaxMind free data does not provide the subnet information (so you would not know that 72.229.28.185 belongs to the subnet 72.229.0.0/17 which would have been extremely useful in case of CheckUser or range blocks.

Never the less, this provides enough information for us to be able to accomplish our goals which are (a) have a special page to get information about IPs, and (b) use that information to provide helpful details to users in the Echo notifications about failed logins (the notification is yet to be built).

Tgr added a comment.Mar 8 2018, 7:49 AM

MaxMind data is practically impossible to localize. We can likely translate continents and countries, but there are just too many cities and maintaining a localization for their names is not feasible.

If this is a big deal you can look them up on Wikidata.

How many names would need a localization, anyway? I don't know if Wikidata would be considered fit for this purpose.

Huji added a comment.Mar 9 2018, 12:59 AM

There are only 7 continents, so that's easy to translate :)

In the latest MaxMind database, there are 250 distinct ISO codes for countries. I would argue even that is easy to translate (we may already have it in Wikidata).

When it gets to cities, it gets ugly. In the latest MaxMind City database (en locale), we have 88355 unique city names, and it is not a clean data set either. We have city names like Šilalė, Štúrovo, İnegöl, Éclépens (all of which contain characters never used in a city's name) or Unorganized Territory of East Central Franklin (which is not even a real city name).

@Huji that is a utf-8 name shown as latin1. Seems there is an extra or missing conversion somewhere.

Huji added a comment.Mar 9 2018, 2:04 AM

@Platonides You are correct, the BOM was missing in the CSV file. Fixing it, we don't have those nonstandard characters anymore. But we still have 88354 cities.

We may try to retrieve localisation from Wikidata if available, or simply sit with the given name if absent. On a statistical basis, if we don't have a localisation on Wikidata, it is more unlikely that such a city will be seen much. BTW, I found this related FAQ. I don't know if it may apply in this case, but it would be a huge help.
As for the subnet, are there any updates? Anyway, if we may rely on the fact that it'll be implemented, it won't be a problem to wait.

Huji added a comment.EditedMar 11 2018, 1:12 AM

@Daimona, you are actually right! I did not think about Wikidata. The location names used by GeoIP are from GeoNames and therefore include the GeoName ID. This is also often available in Wikidata (see https://www.wikidata.org/wiki/Q2113430 for instance). So if we query Wikidata's database to fetch all pages which have a P1566 claim, we should be able to get a list of all pages for which we know the GeoNames ID. At that point, I can compare the list against what comes form MaxMind, and determine for how many of the cities do we have a Wikidata entry (hopefully most or all). And ultimately, we can use Wikidata to get a localized name for the city.

The issues is I have no idea how Wikibase works; I cannot even write a query that gives me a list of all pages with a P1566 claim + the value assigned to their P1566 property. Can you help me?

Alas, I can't do it. I never dealt with wikidata queries, so I'm sorry but I can't help you.

Tgr added a comment.Mar 12 2018, 3:43 AM

You could probably use Special:Whatlinkshere (or the API equivalent).

Huji added a comment.Mar 12 2018, 3:13 PM

@Tgr that only gives me a list of pages for which a claim eixsts for the GeoNames property. It doesn't give me what the value of that property is though. That's the part I have no idea how to fetch.

@Huji: How are you planning on addressing the localization issue? i.e. localizing the place names?

Huji added a comment.EditedMar 12 2018, 11:58 PM

@kaldari with GeoIP2 we can crosswalk from an IP to a GeoNames ID; this ID can be mapped to a city using Wikidata. So the idea is to get the GeoNames ID of the location and do a lookup in Wikidata to find a localized named for the city (and fallback to English if none found).

Before programming this, I want to determine for how many of the GeoNames IDs that appear in the GeoIP data do we have a page on Wikidata, and for each of those, how many localizations can be retrieved from Wikidata. That requires writing a query as I asked in https://lists.wikimedia.org/pipermail/cloud/2018-March/000244.html

Where does the GeoIP2 data come from? Is that included in the WMF's current GeoIP service?

Huji added a comment.EditedMar 13 2018, 12:24 PM

MaxMind. See https://gerrit.wikimedia.org/r/#/c/417474/

My understanding is WMF already uses MaxMind in some other places too (but not in MediaWiki's code)

How reliable is MaxMind? I recall from some enwiki discussions that it sometimes defaults to an incorrect "default" location.

@Huji: I pinged legal to see if they have any concerns.

@JEumerus: Last time I checked, which was about 6 years ago, their city-level location information was pretty unreliable. Not sure how good it is now though.

Huji added a comment.EditedMar 15 2018, 2:58 AM

I did an analysis of MaxMind's city-level data, in conjunction with Wikidata's data on those locations.

MaxMind city-level data for IPv4

The rows in the data are defined at subnet level (first row is for 1.0.0.0/24, next row for 1.0.1.0/24, etc). There are 2.7 million rows of data. For each row, two geographical bits area available: a location (city/province/country) associated with that subnet, and a country associated with the IP registrar. MaxMind also provides an "accuracy radius" (in kilometers) for its location information for that particular subnet.

Table below shows the frequency of subnets based on the accuracy radius of the location MaxMind has for them:

accuracy radiuscount
1262189
5234549
10233862
20290300
50368897
100192045
200286577
500191806
1000601220
NA1287

Let's say an accuracy equal to or more than a 50km is absolutely undesirable. In that sense, 62% of MaxMind's city-level data is of poor accuracy.

In fact, in 22% of the cases, the specific location provided for an IPv4 subnet is a country. For instance, the location returned for the 216.255.128.0/19 subnet is United States, with accuracy of 1000km; note that it specifies a lat/long of 37.7510,-97.8220 for this subnet, and if you make the mistake of using this lat/long (without paying attention to the attested 1000km accuracy) then you will get a location near Wichita, which just happens to be in the center of the US mainland.

Conclusion so far: we can use MaxMind data, but we should make sure only to show city-level data if and only if the attested accuracy level is some reasonable threshold, say 20km or less. In other cases, we should just show the IP and the country.

MaxMind city-level data for IPv6

It is structured similar to the IPv4 data, and has 2.0 million rows. The table looks different, with most subnets having a location with accuracy of 100 kilometers (which is bad, but not terrible).

accuracy radiuscount
1245066
5128964
10122699
20124676
5060809
1001314562
20024551
50012010
10006527

On the other hand, in 62% of cases, the only information available is a country (no city).

Conclusion so far: between IPv4 and IPv6 subnets, I think we can reliably use MaxMind's "city" level information only for about 30-40% of subnets. In the other cases, we should only use country-level data.

Distinct location codes

The distinct list of GeoNames IDs used for IPv4 and IPv6 combined is of size 103006. Of these, about 73% can be found in Wikidata.

If we restrict our analysis just to those cases for which location accuracy is less than 50 kilometers, we will end up with a unique list of 67860 locations, and a Wikidata entry can be found for 77% of those locations.

  • Caveat one: What MaxMind gives you is a GeoNames ID. There is no way you can directly query Wikidata based on the value of the P1566 property in an efficient manner. So we need to create a reverse lookup that would allow walking from a GeoNames ID (e.g. 6252001) to a Wikidata entity (in this case, Q30) so that we can then use that to provide the name of that location in some language (in this case, there are 280 Wikipedia pages associated with Q30 so we can theoretically localize Q30 in more than 200 languages). This reverse lookup has to be updated regularly, hence calling it a "caveat".
  • Caveat two: Q30 is a bad example because it has Wikipedia pages in hundreds of languages associated with it. I'm sure if we look at all those locations with a Wikidata page, most of them are only available in one or few languages.

Overall conclusions:

MaxMind can be used to fetch a relatively accurate location for about 30-40% of IP subnets, and with some work, Wikidata can be used to localize about 75% of these into at least a few languages.

For the remaining 60-70% of IP subnets, we should only use country-level data, which I'm sure can be easily localized to many languages using Wikidata.

Important note: While overall MaxMind's accuracy is poor, note that MaxMind includes data about (arguably) every IP address; this includes consumer IPs (from which our users come) as well as non-consumer iPs (like those used by companies, by servers, etc.) There is no way to distinguish the two groups (that I know of). Hopefully, the majority of poor accuracy data belongs to the latter category.

@kaldari checked in with us; Legal approves display of location information related to unsuccessful logins.

Huji changed the task status from Open to Stalled.Apr 4 2018, 3:33 PM

After several months of following up with MaxMind folks, I don't have any hope that they will incorporate subnet data into their free dataset. Marking it as stalled, until either MaxMind resolves it on their end or we find a different data source.

Huji added a comment.Apr 4 2018, 3:34 PM

We should also explore to see if there is a hacky way to get the data from MaxMind's free data set indirectly (for example does the City data have a subnet ID)?

Change 417474 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/vendor@master] Add geoip2

https://gerrit.wikimedia.org/r/417474

Restricted Application added a project: Community-Tech. · View Herald TranscriptJul 22 2018, 1:06 AM
0x010C added a subscriber: 0x010C.Oct 31 2018, 5:56 PM
Huji added a comment.Jan 8 2019, 7:14 PM

@Daimona I wonder if you could make this happen. I pushed it forward a bunch, but was not skilled enough to get it to the finish line (and also, had difficulty attracting reviews).

@Huji Uhm I could give it a look. Are we blocked on something, given that the task is stalled?

Harej removed a subscriber: Harej.Jan 8 2019, 7:49 PM
Huji changed the task status from Stalled to Open.Jan 10 2019, 7:14 PM

No.

I rebased the patch and bumped geoip version to 2.9.0. I'm also tracking this task, although I'm afraid there's not much I can do.

Daimona moved this task from Backlog to Under review on the User-Daimona board.Jan 11 2019, 9:20 AM