Page MenuHomePhabricator

Create a mechanism that allows fetching geolocation and subnet data for IP addresses
Open, HighPublic

Description

Extensions that have to do with a user's IP (such as CheckUser and MediaWiki-extensions-LoginNotify ) would benefit if we could show the "geolocation" of an IP address along with information about its "IP subnet". For instance, the IP address 8.8.8.8 belongs to Google Inc. and is located in Mountain View, CA, and it belongs to the IPv4 subnet 8.8.8.0/24.

  • In case of CheckUser, if I retrieve results for a user and he has edits from 8.8.8.1, 8.8.8.2, 8.8.8.3, 8.8.8.11, etc, I would love to know that they are all from the same IPv4 subnet. Currently, CheckUser does not provide that information and I have to retrieve the data using third party resources.
  • In case of LoginNotify, this extension warns me if I log in from a "new" IP address. It assumes that an IP address is "known" if it is within the same /24 subnet of my other known IPs. But this assumption is inaccurate as many IPs belong to wider subnets like /16 or /21 and I would like not to get false notifications in such circumstances.

There exists at least one service provider (https://ipinfo.io) that provides all the information we need for this task through an API (geolocation, ISP name, IP subnet, etc.) However, we need to obtain a license from them (both for legal reasons as well as the fact that IP sunbet data is not available for free through the API).

Alternatively, we can obtain the data in a single dump (not through an API). No matter how the data is retrieved, we would like to have an extension that simplifies obtaining, re-obtaining (e.g. every three months), storing, and returning this data.

Data source, retrieval and update are to be discussed in the parent task. This task focuses on technical aspects (DB schema, classes and methods, etc)

Details

Related Gerrit Patches:
mediawiki/core : masterIntroduce GeoIPLookup and Special:AboutIP
mediawiki/vendor : masterAdd geoip2/geoip2 2.9.0

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
DannyH added a subscriber: DannyH.Oct 19 2017, 6:48 PM
Huji set the point value for this task to 5.Oct 26 2017, 2:00 PM
Huji added a project: User-Huji.
MaxSem added a subscriber: MaxSem.Oct 26 2017, 6:15 PM

I don't think this will work:

  • GeoIP databases aren't as accurate as people think.
  • They are even less accurate outside the US
  • Relying on GeoIP based locations can create serious problems
  • Maxmind databases are in English only. We don't have any means to translate them.
MaxSem removed the point value for this task.Oct 26 2017, 6:15 PM
Huji added a comment.Oct 26 2017, 8:26 PM

I don't think this will work:

  • GeoIP databases aren't as accurate as people think.
  • They are even less accurate outside the US
  • Relying on GeoIP based locations can create serious problems
  • Maxmind databases are in English only. We don't have any means to translate them.

I can think of even more reasons why this information can be wrong. For instance, the IP can be a proxy and the intruder's actual location may differ from what the IP shows.

But that has not stopped companies like Google, Facebook, etc. to use geoIP information for the same exact use case. And it shouldn't stop us. We can never have perfect data (many IPs are located in countries where political, financial, or privacy reasons don't allow an accurate location mapping). But we should not let perfect be the enemy of good.

Huji added a comment.Oct 26 2017, 8:31 PM

@Legoktm as you know, MaxMind's ASN data currently does not include the subnet. I asked them to add it to the ASN data, and they responded after a few weeks, saying that there are three types of subnet information, and asking which one we want:

1. The subnet value seen in the ISP/ASN CSV files. This is really a factor of how we build the binary database. It may line up with how an ISP is announcing their IP addresses, but isn't guaranteed to do so.
2. The prefixes announced by ISP via BGP. I would expect the subnets shown in the ASN CSV file to have a better chance at aligning with what prefixes ISP's are actually announcing, but that also isn't guaranteed.
3. The subnets that showup in WHOIS data, either from direct allocations by an RIR or reallocation by an ISP.

They think #3 is the best choice for us. Do you agree?

Huji renamed this task from Create an extension that allows fetching geolocation and subnet data for IP addresses to Create a mechanism that allows fetching geolocation and subnet data for IP addresses.Nov 1 2017, 12:53 AM

@Huji, yep #3 sounds good to me too.

Huji added a comment.Nov 1 2017, 2:09 AM

@Legoktm For the record, because #3 is not already in their CSV data, the folks at MaxMind said it will take months for them to introduce that data. I asked for #1 to be added, as an interim solution until they can introduce #3. They said that can probably be done in shorter term (weeks). I am still awaiting a timeline from them.

Tgr added a subscriber: Tgr.Jan 26 2018, 5:42 AM
Harej added a subscriber: Harej.Jan 29 2018, 8:21 PM

Can you clarify whether this is proposed as a new extension or as an enhancement to the existing CheckUser functionality?

Krinkle added a subscriber: Krinkle.Mar 8 2018, 3:46 AM

Change 376451 had a related patch set uploaded (by Huji; owner: Legoktm):
[mediawiki/core@master] Introduce GeoIPLookup and Special:AboutIP

https://gerrit.wikimedia.org/r/376451

Huji added a comment.Mar 8 2018, 4:10 AM

Some screenshots using the latest *free* MaxMind data:


The following caveats exist:

  1. MaxMind data is incomplete. For instance for IP 72.229.28.185 you get all four elements, but for 8.4.4.8 you don't get a city.
  2. MaxMind data is practically impossible to localize. We can likely translate continents and countries, but there are just too many cities and maintaining a localization for their names is not feasible.
  3. MaxMind free data does not provide the subnet information (so you would not know that 72.229.28.185 belongs to the subnet 72.229.0.0/17 which would have been extremely useful in case of CheckUser or range blocks.

Never the less, this provides enough information for us to be able to accomplish our goals which are (a) have a special page to get information about IPs, and (b) use that information to provide helpful details to users in the Echo notifications about failed logins (the notification is yet to be built).

Tgr added a comment.Mar 8 2018, 7:49 AM

MaxMind data is practically impossible to localize. We can likely translate continents and countries, but there are just too many cities and maintaining a localization for their names is not feasible.

If this is a big deal you can look them up on Wikidata.

How many names would need a localization, anyway? I don't know if Wikidata would be considered fit for this purpose.

Huji added a comment.Mar 9 2018, 12:59 AM

There are only 7 continents, so that's easy to translate :)

In the latest MaxMind database, there are 250 distinct ISO codes for countries. I would argue even that is easy to translate (we may already have it in Wikidata).

When it gets to cities, it gets ugly. In the latest MaxMind City database (en locale), we have 88355 unique city names, and it is not a clean data set either. We have city names like Šilalė, Štúrovo, İnegöl, Éclépens (all of which contain characters never used in a city's name) or Unorganized Territory of East Central Franklin (which is not even a real city name).

@Huji that is a utf-8 name shown as latin1. Seems there is an extra or missing conversion somewhere.

Huji added a comment.Mar 9 2018, 2:04 AM

@Platonides You are correct, the BOM was missing in the CSV file. Fixing it, we don't have those nonstandard characters anymore. But we still have 88354 cities.

We may try to retrieve localisation from Wikidata if available, or simply sit with the given name if absent. On a statistical basis, if we don't have a localisation on Wikidata, it is more unlikely that such a city will be seen much. BTW, I found this related FAQ. I don't know if it may apply in this case, but it would be a huge help.
As for the subnet, are there any updates? Anyway, if we may rely on the fact that it'll be implemented, it won't be a problem to wait.

Huji added a comment.EditedMar 11 2018, 1:12 AM

@Daimona, you are actually right! I did not think about Wikidata. The location names used by GeoIP are from GeoNames and therefore include the GeoName ID. This is also often available in Wikidata (see https://www.wikidata.org/wiki/Q2113430 for instance). So if we query Wikidata's database to fetch all pages which have a P1566 claim, we should be able to get a list of all pages for which we know the GeoNames ID. At that point, I can compare the list against what comes form MaxMind, and determine for how many of the cities do we have a Wikidata entry (hopefully most or all). And ultimately, we can use Wikidata to get a localized name for the city.

The issues is I have no idea how Wikibase works; I cannot even write a query that gives me a list of all pages with a P1566 claim + the value assigned to their P1566 property. Can you help me?

Alas, I can't do it. I never dealt with wikidata queries, so I'm sorry but I can't help you.

Tgr added a comment.Mar 12 2018, 3:43 AM

You could probably use Special:Whatlinkshere (or the API equivalent).

Huji added a comment.Mar 12 2018, 3:13 PM

@Tgr that only gives me a list of pages for which a claim eixsts for the GeoNames property. It doesn't give me what the value of that property is though. That's the part I have no idea how to fetch.

@Huji: How are you planning on addressing the localization issue? i.e. localizing the place names?

Huji added a comment.EditedMar 12 2018, 11:58 PM

@kaldari with GeoIP2 we can crosswalk from an IP to a GeoNames ID; this ID can be mapped to a city using Wikidata. So the idea is to get the GeoNames ID of the location and do a lookup in Wikidata to find a localized named for the city (and fallback to English if none found).

Before programming this, I want to determine for how many of the GeoNames IDs that appear in the GeoIP data do we have a page on Wikidata, and for each of those, how many localizations can be retrieved from Wikidata. That requires writing a query as I asked in https://lists.wikimedia.org/pipermail/cloud/2018-March/000244.html

Where does the GeoIP2 data come from? Is that included in the WMF's current GeoIP service?

Huji added a comment.EditedMar 13 2018, 12:24 PM

MaxMind. See https://gerrit.wikimedia.org/r/#/c/417474/

My understanding is WMF already uses MaxMind in some other places too (but not in MediaWiki's code)

How reliable is MaxMind? I recall from some enwiki discussions that it sometimes defaults to an incorrect "default" location.

@Huji: I pinged legal to see if they have any concerns.

@JEumerus: Last time I checked, which was about 6 years ago, their city-level location information was pretty unreliable. Not sure how good it is now though.

Huji added a comment.EditedMar 15 2018, 2:58 AM

I did an analysis of MaxMind's city-level data, in conjunction with Wikidata's data on those locations.

MaxMind city-level data for IPv4

The rows in the data are defined at subnet level (first row is for 1.0.0.0/24, next row for 1.0.1.0/24, etc). There are 2.7 million rows of data. For each row, two geographical bits area available: a location (city/province/country) associated with that subnet, and a country associated with the IP registrar. MaxMind also provides an "accuracy radius" (in kilometers) for its location information for that particular subnet.

Table below shows the frequency of subnets based on the accuracy radius of the location MaxMind has for them:

accuracy radiuscount
1262189
5234549
10233862
20290300
50368897
100192045
200286577
500191806
1000601220
NA1287

Let's say an accuracy equal to or more than a 50km is absolutely undesirable. In that sense, 62% of MaxMind's city-level data is of poor accuracy.

In fact, in 22% of the cases, the specific location provided for an IPv4 subnet is a country. For instance, the location returned for the 216.255.128.0/19 subnet is United States, with accuracy of 1000km; note that it specifies a lat/long of 37.7510,-97.8220 for this subnet, and if you make the mistake of using this lat/long (without paying attention to the attested 1000km accuracy) then you will get a location near Wichita, which just happens to be in the center of the US mainland.

Conclusion so far: we can use MaxMind data, but we should make sure only to show city-level data if and only if the attested accuracy level is some reasonable threshold, say 20km or less. In other cases, we should just show the IP and the country.

MaxMind city-level data for IPv6

It is structured similar to the IPv4 data, and has 2.0 million rows. The table looks different, with most subnets having a location with accuracy of 100 kilometers (which is bad, but not terrible).

accuracy radiuscount
1245066
5128964
10122699
20124676
5060809
1001314562
20024551
50012010
10006527

On the other hand, in 62% of cases, the only information available is a country (no city).

Conclusion so far: between IPv4 and IPv6 subnets, I think we can reliably use MaxMind's "city" level information only for about 30-40% of subnets. In the other cases, we should only use country-level data.

Distinct location codes

The distinct list of GeoNames IDs used for IPv4 and IPv6 combined is of size 103006. Of these, about 73% can be found in Wikidata.

If we restrict our analysis just to those cases for which location accuracy is less than 50 kilometers, we will end up with a unique list of 67860 locations, and a Wikidata entry can be found for 77% of those locations.

  • Caveat one: What MaxMind gives you is a GeoNames ID. There is no way you can directly query Wikidata based on the value of the P1566 property in an efficient manner. So we need to create a reverse lookup that would allow walking from a GeoNames ID (e.g. 6252001) to a Wikidata entity (in this case, Q30) so that we can then use that to provide the name of that location in some language (in this case, there are 280 Wikipedia pages associated with Q30 so we can theoretically localize Q30 in more than 200 languages). This reverse lookup has to be updated regularly, hence calling it a "caveat".
  • Caveat two: Q30 is a bad example because it has Wikipedia pages in hundreds of languages associated with it. I'm sure if we look at all those locations with a Wikidata page, most of them are only available in one or few languages.

Overall conclusions:

MaxMind can be used to fetch a relatively accurate location for about 30-40% of IP subnets, and with some work, Wikidata can be used to localize about 75% of these into at least a few languages.

For the remaining 60-70% of IP subnets, we should only use country-level data, which I'm sure can be easily localized to many languages using Wikidata.

Important note: While overall MaxMind's accuracy is poor, note that MaxMind includes data about (arguably) every IP address; this includes consumer IPs (from which our users come) as well as non-consumer iPs (like those used by companies, by servers, etc.) There is no way to distinguish the two groups (that I know of). Hopefully, the majority of poor accuracy data belongs to the latter category.

@kaldari checked in with us; Legal approves display of location information related to unsuccessful logins.

Huji changed the task status from Open to Stalled.Apr 4 2018, 3:33 PM

After several months of following up with MaxMind folks, I don't have any hope that they will incorporate subnet data into their free dataset. Marking it as stalled, until either MaxMind resolves it on their end or we find a different data source.

Huji added a comment.Apr 4 2018, 3:34 PM

We should also explore to see if there is a hacky way to get the data from MaxMind's free data set indirectly (for example does the City data have a subnet ID)?

Change 417474 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/vendor@master] Add geoip2

https://gerrit.wikimedia.org/r/417474

Restricted Application added a project: Community-Tech. · View Herald TranscriptJul 22 2018, 1:06 AM
0x010C added a subscriber: 0x010C.Oct 31 2018, 5:56 PM
Huji added a comment.Jan 8 2019, 7:14 PM

@Daimona I wonder if you could make this happen. I pushed it forward a bunch, but was not skilled enough to get it to the finish line (and also, had difficulty attracting reviews).

@Huji Uhm I could give it a look. Are we blocked on something, given that the task is stalled?

Harej removed a subscriber: Harej.Jan 8 2019, 7:49 PM
Huji changed the task status from Stalled to Open.Jan 10 2019, 7:14 PM

No.

I rebased the patch and bumped geoip version to 2.9.0. I'm also tracking this task, although I'm afraid there's not much I can do.

Daimona moved this task from Backlog to Under review on the User-Daimona board.Jan 11 2019, 9:20 AM
AronManning added a comment.EditedNov 7 2019, 9:15 AM

Besides subnet (whois) and geolocation, presenting Proxy/VPN data would be useful for CU. Example: https://www.ipqualityscore.com/free-ip-lookup-proxy-vpn-test/lookup/86.187.160.157
This data is no more reliable than Geoloc data, yet it is a good clue for determining the likelihood of a proxy user.

Geoloc DBs are sometimes incorrect. There is a service that compares results from 3 providers: https://www.iplocation.net/?query=86.187.160.157
It should be considered to use multiple databases and list all results if those don't match, like: (Poole, UK | London, UK | Sheffield, UK)

This information will be used in T237593 [Epic] CheckUser 2.0: Compare

kaldari added a subscriber: Reedy.Nov 7 2019, 8:41 PM

@Huji - Has geoip2 been security reviewed by the WMF yet? I saw that @Reedy had looked over the patch, but not sure if anyone's done a formal security review. If not, let me know if I can help with that.

Reedy added a comment.Nov 7 2019, 9:06 PM

@Huji - Has geoip2 been security reviewed by the WMF yet? I saw that @Reedy had looked over the patch, but not sure if anyone's done a formal security review. If not, let me know if I can help with that.

We haven't no. It was mostly trying to help get the patches into a way that CI liked.

Change 417474 had a related patch set uploaded (by Reedy; owner: Huji):
[mediawiki/vendor@master] Add geoip2/geoip2 2.9.0

https://gerrit.wikimedia.org/r/417474

Reedy added a comment.Nov 8 2019, 4:01 AM

If this is something we want to move ahead with... Per MediaWiki-extension-requests workboard column "Does This Need To Be an Extension?" we should probably decide that and work out moving ahead

@Niharika - This is something that AHT team wants, correct? Huji's already done some work to add the geoip2 library and some related MaxMind components as a dependency to core (which actually dates back to our work on LoginNotify), so we could go ahead and request an initial security review, if this seems like it would be useful to you (for CentralAuth and/or LoginNotify).

@Niharika - This is something that AHT team wants, correct? Huji's already done some work to add the geoip2 library and some related MaxMind components as a dependency to core (which actually dates back to our work on LoginNotify), so we could go ahead and request an initial security review, if this seems like it would be useful to you (for CentralAuth and/or LoginNotify).

Thumbs up for requesting a security review.

Huji added a comment.Nov 8 2019, 10:55 PM

@Huji - Has geoip2 been security reviewed by the WMF yet? I saw that @Reedy had looked over the patch, but not sure if anyone's done a formal security review. If not, let me know if I can help with that.

Reedy and Legoktm have helped with this from a coding perspective. No security review has been done (and I also give a thumbs up to doing one).