Page MenuHomePhabricator

Investigate services to fetch underlying information about IP addresses [8 Hours]
Closed, ResolvedPublicApr 17 2020

Description

Goal

IP addresses reveal a whole host of important information that help users decide their next steps when taking an action during patrolling and other activities on the wikis. The aim of this feature is to surface some of the information that WHOIS on an IP address provides next to the IP in the wiki interface itself so users are able to successfully perform their duties without any disruption when IPs are masked.

This information includes (may not be limited to):

  • Location information: city, state, country
  • Owner information: company/institution that owns an IP address
  • Membership of the IP address in online blacklists
  • Whether the IP address is known to be a VPN or Tor node
  • Size of the IP block owned by the company
Acceptance criteria

If that doesn't suffice, then...

  • Investigate services we can utilize to fetch this information and making a recommendation list based on criteria such as:
    • Data available
    • Accuracy of data
    • Frequency of data update
    • Limits on access, if any
    • Translation of data, if available
    • Ease of access

This investigation does not need us to look through the privacy policy or ToU of the website. We will follow-up on that once we know which services we want to use.

Details

Due Date
Apr 17 2020, 4:00 AM

Event Timeline

Niharika triaged this task as Medium priority.Mar 25 2020, 10:46 PM
Niharika created this task.
ARamirez_WMF renamed this task from Investigate services to fetch underlying information about IP addresses to Investigate services to fetch underlying information about IP addresses [8 Hours].Apr 8 2020, 4:57 PM

Whether the IP address is known to be a VPN or Tor node

This is a tricky one. One of the reasons I developed ipcheck was because there was an over-reliance on IPQS, which has a hair trigger when it comes to VPN/Tor/Proxy. There are often times many factors to consider when making this determination. I've had fairly decent success using machine learning to interpret ipcheck results, but especially given the current crisis I simply don't have the free time to work on it.

ASN / whois filtering is a very good way to go as far as VPN / Webhosts go, a la: Non-blocked compute hosts. There are a bunch of good ways to auto-filter azure, aws, and google cloud as well.

Whether the IP address is known to be a VPN or Tor node

This is a tricky one. One of the reasons I developed ipcheck was because there was an over-reliance on IPQS, which has a hair trigger when it comes to VPN/Tor/Proxy. There are often times many factors to consider when making this determination. I've had fairly decent success using machine learning to interpret ipcheck results, but especially given the current crisis I simply don't have the free time to work on it.

ASN / whois filtering is a very good way to go as far as VPN / Webhosts go, a la: Non-blocked compute hosts. There are a bunch of good ways to auto-filter azure, aws, and google cloud as well.

Thanks @SQL. This is helpful information.

ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".Apr 16 2020, 2:36 AM

Thanks @SQL. This is helpful information.

No problem, @Niharika . I use external services to accomplish almost everything in this task, so if there's any way I can help, let me know.

During the course of the investigation, I learned a great deal about how IP addresses are allocated. I'm certainly not an expert in this, and I may (read: very likely) still be misinformed, but hopefully I can convey at least what I've learned. If anything in here is wrong, I apologize in advance, it is an oversight.

WHOIS

Allocation of IP addresses is done by IANA which is part of ICANN. The IANA has five registries that maintain the IP addresses for their respective regions. Most of North America is controlled by ARIN. The registry will assign an autonomous system number (ASN) to an "entity" (usually an ISP). This entity then can be assigned blocks of IPs.

What people refer to when they say "WHOIS" (at least in terms of IP address allocation) is the public data that is available from these registries.

As an example, if I want to lookup my IP address, I could go to the IANA WHOIS and lookup my IP address. That wont give me any details except that this IP address is administered by ARIN. I could then go to the ARIN WHOIS and that will tell me that my IP is part of the NET-142-196-0-0-1 block which is assigned to the ASN 33363. I believe there could be overlapping blocks (which could have different ASN's). There is also a possibility (at least in their data model) to have multiple ASNs to a single block.

If that wasn't confusing enough, a block like NET-142-196-0-0-1 can be assigned an organization. In this case it's CC-3518. This is more for organization purposes and does not necessarily represent a legal entity. That is how an IP block is assigned to a "country". Though it is possible for the block to span more than one country. However, just to complicate things more, this data model is specific to ARIN and may be slightly different for the other registrars.

The registrars do not dictate how ASNs distribute IP addresses. As an example, my ISP is free to distribute the IP one digit away from mine (in the same block) to someone else on the other side of the country. That, typically, does not happen (because of the way networking equipment is typically setup), but there isn't anything actually preventing this. It may already be the case for mobile networks.

To recap: Your IP is in one (or more) block(s) that may have an attached "organization" (not necessarily a legal entity) that is assigned to one (or more) legal entities (ASN) which is administered by one of 5 registrars for a single global organization. (whew!)

So what can you get from this public data from the acceptance citeria?

  • Location information: city, state, country
  • Owner information: company/institution that owns an IP address
  • Membership of the IP address in online blacklists
  • Whether the IP address is known to be a VPN or Tor node
  • Size of the IP block owned by the company

Anything beyond that data either comes from a collection of freely licensed or proprietary datasets.

Wikidata

Of course, this data isn't localized at all. Thankfully, Wikidata can help a bit with that. They have added autonomous system number property. We can therefor do a search for the ASN I am on:
https://www.wikidata.org/wiki/Special:Search/haswbstatement:P3797=33363
likewise the ISO 3166-1 alpha-2 code property can return us the proper country:
https://www.wikidata.org/wiki/Special:Search/haswbstatement:P297=US
though the coverage for ASN needs to be improved.

Wikidata also has properties for the blocks of IP addresses (IPv4 routing prefix and IPv6 routing prefix), the coverage is very poor, and I'm not convinced the data model these are being used for makes sense. They are being used directly on ASNs (and other orgs with static IPs) to list out the blocks that apply to them. I think it might make more sense to make an item for each IP Address block and attach the relevant data to that block (much like WHOIS is doing). But at a certain point we may be duplicating a lot of the data that is already in WHOIS. On the other hand, it's all public and maybe there is a benefit to standardizing the data model and would allow contributors to provide additional (sourced) data on the blocks themselves.

It might be useful to create a bot to harvest data from WHOIS and move it into Wikidata, even if it's simply to make it easier to access for ourselves. Alternatively, making some sort of web service to do that "on the fly" would be helpful.

GeoIP

This leads us to the concept of GeoIP. I started this investigation looking through T174553, which focuses on integrating the "free" (as in speech and as in beer) MaxMind (maxmind.com) database (the company also has a higher quality proprietary dataset that they license). I've left several comments on the attached patch. To summarize, there are some performance issues that need to be resolved. Most notably, the database needs to be normalized and imported into the MediaWiki database on a weekly basis.

However, this brings up a lot of issues with GeoIP in general that I think are worth addressing. I think this was best summarized in this comment:

I don't think this will work:

  • GeoIP databases aren't as accurate as people think.
  • They are even less accurate outside the US
  • Relying on GeoIP based locations can create serious problems
  • Maxmind databases are in English only. We don't have any means to translate them.

As you may recall from the WHOIS section of this comment, the public data does not provide a location. Even the country isn't always completely accurate as it's the country the ASN has organized the IP block in, not necessarily where they are using it.

Effectively, any location data on an IP, is an educated(?) guess.

NOTE: I had some more thoughts on this which I added in T248525#6089249

The last point in that comment can be resolved, because once again, Wikidata comes to the rescue. Most GeoIP services return the GeoNames ID of the location which can be searched for in Wikidata:
https://www.wikidata.org/wiki/Special:Search/haswbstatement:P1566=4155751

However, I do not think it is wise to legitimize the use of a guess, one in which we have basically no insights into how that guess was formed. We effectively cannot provide a source for this information, nor does the dataset itself qualify as a reliable source.

Regardless of my concerns, I took a stab at comparing some of the top GeoIP companies that I could find. This obviously is not an exhaustive list. I subjectively gave each service a 1 (worst) to 5 (best) rating for each category based on their marketing materials (what was on their website and public documentation). This is not an indication of data quality, pricing, or any other indication that one is better than the other. It's only an indication of where they at least say their strengths and weaknesses are.

There were two services that stuck out to me:

  • IPinfo (ipinfo.io) - This company provided most of the data we are looking for in a highly normalized way. They would also give us some free requests for attributing the data to them. Attribution isn't required by their license, but honestly might be a good idea anyways.
  • Auth0 Signals (auth0.com/signals) - Auth0 recently completed the purchase of a company that specialized in threat assessment based on IP address. This dataset appears to mostly be an aggregation of public (and private?) blacklists. The good thing about this dataset is that it cites its sources (from what I can tell). I imagine the "score" of the API will also improve as they integrate the product into Auth0's identity services.
Proposed Resolution

To surface information about IP addresses, the only "trustworthy" information is that which is found in the WHOIS. I think it might be worth normalizing and localizing this data either on-demand or with a bot on Wikidata.

Beyond that, I think it's important to try to understand what they are trying to do with the IP address. Perhaps what they are trying to do is something that can't reliably and safely be done.

If the community would like additional data, I think it would be wise, as pointed out in T174553#5643405, to use multiple datasets. I think it makes sense to use multiple web services and attribute the data to the dataset and any references it provides. This way, the community will be able to determine which datasets are better and we can alter the configuration to remove ones that are not. For instance, a wiki may find IPinfo to be more reliable and another finds MaxMind is more reliable. Using their webservices will allow us to deliver each dataset to the respective wiki without the need to import the data into the MediaWiki database.

It's possible that all of the wikis coalesce around a single, high quality dataset, if that is the case then we can remove the other services and make performance improvements (like importing the dataset, etc.).

@dbarratt Thank you! This was really informative for me!

To recap: Your IP is in one (or more) block(s) that may have an attached "organization" (not necessarily a legal entity) that is assigned to one (or more) legal entities (ASN) which is administered by one of 5 registrars for a single global organization. (whew!)

Jeez.

As you may recall from the WHOIS section of this comment, the public data does not provide a location. Even the country isn't always completely accurate as it's the country the ASN has organized the IP block in, not necessarily where they are using it.

Effectively, any location data on an IP, is an educated(?) guess.

Hmm.

However, I do not think it is wise to legitimize the use of a guess, one in which we have basically no insights into how that guess was formed. We effectively cannot provide a source for this information, nor does the dataset itself qualify as a reliable source.

Regardless of my concerns, I took a stab at comparing some of the top GeoIP companies that I could find.

Awesome.

This obviously is not an exhaustive list. I subjectively gave each service a 1 (worst) to 5 (best) rating for each category based on their marketing materials (what was on their website and public documentation). This is not an indication of data quality, pricing, or any other indication that one is better than the other. It's only an indication of where they at least say their strengths and weaknesses are.

There were two services that stuck out to me:

  • IPinfo (ipinfo.io) - This company provided most of the data we are looking for in a highly normalized way. They would also give us some free requests for attributing the data to them. Attribution isn't required by their license, but honestly might be a good idea anyways.
  • Auth0 Signals (auth0.com/signals) - Auth0 recently completed the purchase of a company that specialized in threat assessment based on IP address. This dataset appears to mostly be an aggregation of public (and private?) blacklists. The good thing about this dataset is that it cites its sources (from what I can tell). I imagine the "score" of the API will also improve as they integrate the product into Auth0's identity services.
Proposed Resolution

To surface information about IP addresses, the only "trustworthy" information is that which is found in the WHOIS. I think it might be worth normalizing and localizing this data either on-demand or with a bot on Wikidata.

Beyond that, I think it's important to try to understand what they are trying to do with the IP address. Perhaps what they are trying to do is something that can't reliably and safely be done.

The community is known to access an IP's location often:

  • to understand which country and hence, cultural background someone might be coming from
  • to assess whether or not they might know much about a given topic/article.

To my understanding, for smaller countries knowing just the "Country" would suffice but for say, USA or China, someone might want to dig deeper into the city/state information.
Knowing if an IP is considered trustworthy by other organizations (blacklists) and if the IP is deliberately behaving suspiciously (VPN/Tor) is a helpful tool while patrolling, to quickly assess if an IP needs to be blocked or not.
I think most people accessing this information know that it is never 100% accurate. We heard that several times when we pitched these tools. The way they currently deal with this is either by having a single website they trust that they go to or checking multiple websites. One commonly used tool is ipcheck on ToolForge which coalesces this information from several websites.

If the community would like additional data, I think it would be wise, as pointed out in T174553#5643405, to use multiple datasets. I think it makes sense to use multiple web services and attribute the data to the dataset and any references it provides. This way, the community will be able to determine which datasets are better and we can alter the configuration to remove ones that are not. For instance, a wiki may find IPinfo to be more reliable and another finds MaxMind is more reliable. Using their webservices will allow us to deliver each dataset to the respective wiki without the need to import the data into the MediaWiki database.

It's possible that all of the wikis coalesce around a single, high quality dataset, if that is the case then we can remove the other services and make performance improvements (like importing the dataset, etc.).

My gut feeling is that we will have to go the route of using multiple services to surface this information and attribute the data to those services, like you mention. We should discuss this in a team meeting more and figure out what part, if any, does Wikidata play here. On one hand, in the long run it would be helpful to have our own reliable dataset for this information but at the same time it is going to a burden on our editors to keep up with this data. We wouldn't want it to fall into unmaintained territory.

I think most people accessing this information know that it is never 100% accurate. We heard that several times when we pitched these tools. The way they currently deal with this is either by having a single website they trust that they go to or checking multiple websites. One commonly used tool is ipcheck on ToolForge which coalesces this information from several websites.

Ironically I can't access the tool with my staff account, but I was able to use my personal account. We may need to update the access of the tool to allow (WMF) accounts.

Yes, this does do what I was kind of thinking we'd end up with anyways. I think it would be nice if this was easier for users to consume. I feel like this tool (in it's current design) gives you all the data (which is helpful!) but at the expense of simplicity. Our solution(s) should, I imagine, strive for simplicity.

My gut feeling is that we will have to go the route of using multiple services to surface this information and attribute the data to those services, like you mention.

Totally. If they are using these datasets anyways we might as well make it easier for our users to use/consume.

I had a thought that we should be careful to indicate that consensus among the datasets does not imply accuracy (nor does a lack of consensus imply inaccuracy). For instance, if five services believe my IP is in NYC, that doesn't make it accurate. I'm not sure how we could do this, but it's important to keep in mind.

We should discuss this in a team meeting more and figure out what part, if any, does Wikidata play here. On one hand, in the long run it would be helpful to have our own reliable dataset for this information but at the same time it is going to a burden on our editors to keep up with this data. We wouldn't want it to fall into unmaintained territory.

Yeah I was wondering if, at a minimum, we could utilize Wikidata only for localization. Basically they would use Wikidata to translate the data, but we wouldn't actually use it as a dataset itself. Effectually they would use it the same way that content translation or translatewiki is used today. We could even integrate this experience into our tools (by allowing them to provide a translated label directly from the tool).

Thanks for putting this together. Lots of stuff to consider.

It's looking more and more like we'll be collating data from a few different sources. A question I have about that is whether we want to store the collated data somewhere so that we don't always have to rebuild it. How would that work with caching or when the data needs to change? Is it even a valid concern?

I just worry that hitting 4 APIs and pulling the data together for hundreds of IPs in an investigation could get very expensive.

Thanks for putting this together. Lots of stuff to consider.

It's looking more and more like we'll be collating data from a few different sources. A question I have about that is whether we want to store the collated data somewhere so that we don't always have to rebuild it. How would that work with caching or when the data needs to change? Is it even a valid concern?

I just worry that hitting 4 APIs and pulling the data together for hundreds of IPs in an investigation could get very expensive.

The way I handle it @ ipcheck is to cache it. I've found that much of the data involved here does not change very frequently. I believe that my current expiration is ~2 weeks, and might actually be too short. I do provide a mechanism to manually invalidate the cache and re-request from each API.

Monetary cost is a concern as well where API hits are concerned. At the moment, I have a grant from the Foundation for ~$600 USD/year for only 50,000 queries per month, and that's just for IPQS. I believe I have more detailed costs for all of the API's I use saved somewhere. Spoiler: It adds up really fast.

I think most people accessing this information know that it is never 100% accurate. We heard that several times when we pitched these tools. The way they currently deal with this is either by having a single website they trust that they go to or checking multiple websites. One commonly used tool is ipcheck on ToolForge which coalesces this information from several websites.

Ironically I can't access the tool with my staff account, but I was able to use my personal account. We may need to update the access of the tool to allow (WMF) accounts.

You're welcome to raise an issue at the issue tracker (or, even my talkpage on enwiki), and I'd be happy to discuss it in either place. This doesn't seem like the right place to do so right now.

Effectively, any location data on an IP, is an educated(?) guess.

Upon further reflection, I don't think I was being very generous in my initial assessment. While it's true that they may represent a "guess", I do not mean to imply that the data is therefor not accurate.

Take for example a product like Google Maps. While the data probably originally came from public or freely licensed datasets, nowadays, the company drives a car down every known road in order to get images and mapping data. In some ways, this method is more accurate than the public / freely licensed alternatives.

I would appreciate more transparency from these vendors on how they derive their answers.

If I may rephrase what I said:
Effectively, any location data on an IP, is an educated(?) guess proprietary (with the notable exception of MaxMind Lite, but the company says it is less accurate than their proprietary alternative).

It's looking more and more like we'll be collating data from a few different sources. A question I have about that is whether we want to store the collated data somewhere so that we don't always have to rebuild it. How would that work with caching or when the data needs to change? Is it even a valid concern?

I assume you are talking about GeoIP services rather than the public WHOIS data?

Since the dataset is proprietary, we can't actually store the data per the licensing policy. However, the individual data points are "facts" and cannot be copyrighted so we are free to store/display the data for an individual record, but not the whole or any aggregation derived from the whole (if I'm understanding this correctly).

I just worry that hitting 4 APIs and pulling the data together for hundreds of IPs in an investigation could get very expensive.

I think we may need to not allow filtering of the data that relies on proprietary datasets. We can certainly show the location for all the IPs on the page, but filtering/sorting on that data would require retrieving all of them, which is problematic (unless we can get an exemption to the licensing policy).

You're welcome to raise an issue at the issue tracker (or, even my talkpage on enwiki), and I'd be happy to discuss it in either place. This doesn't seem like the right place to do so right now.

Thanks! Will do, I just wanted to mention it in case other staff reading this thread were unable to access the tool. :)

Thanks! Will do, I just wanted to mention it in case other staff reading this thread were unable to access the tool. :)

And here it is! https://github.com/SQL-enwiki/ipcheck/issues/29

We discussed this task today in our meeting. Takeaways:

  • @Niharika to check with Legal/Security about the using the two services that David brought up.
  • @Tchanders to write a investigation task for building an MVP for exploring the services. Niharika will add product-y questions we would like to answer with the help of the MVP.
  • We'll have to kick-off discussions with Security about using external APIs in an extension.
  • We'll need to look into accuracy of these services.
  • We'll need to look into budget money for using these services.

Thanks for putting this together. Lots of stuff to consider.

It's looking more and more like we'll be collating data from a few different sources. A question I have about that is whether we want to store the collated data somewhere so that we don't always have to rebuild it. How would that work with caching or when the data needs to change? Is it even a valid concern?

I just worry that hitting 4 APIs and pulling the data together for hundreds of IPs in an investigation could get very expensive.

The way I handle it @ ipcheck is to cache it. I've found that much of the data involved here does not change very frequently. I believe that my current expiration is ~2 weeks, and might actually be too short. I do provide a mechanism to manually invalidate the cache and re-request from each API.

Monetary cost is a concern as well where API hits are concerned. At the moment, I have a grant from the Foundation for ~$600 USD/year for only 50,000 queries per month, and that's just for IPQS. I believe I have more detailed costs for all of the API's I use saved somewhere. Spoiler: It adds up really fast.

@SQL do you ever hit the limit of 50k per month?
Seeing the detailed costs for the APIs would be awesome if you can dig it up.

@Niharika, I've emailed you directly with the costs and limits - and explained why in the email.

@dbarratt, thank you!

@Niharika, I've emailed you directly with the costs and limits - and explained why in the email.

Perfect. Thank you so much!

Marking this ticket as resolved. Follow up in T251602: Draft: Build a prototype for the IP info feature.

Just incase you weren't aware, we do have MaxMind stuff in production too, including some dataset that WMF pays for - https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/puppetmaster/manifests/geoip.pp#L23-L24 and https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/puppetmaster/manifests/geoip.pp#L31-L42

It should have things like their version of geolocation, "owners"/ISPs etc

Might be worth talking to SRE to see what exactly the WMF has, and whether we can reuse it (I don't have any idea of the scope of our contracts with them), rather than necessarily having to buy/use something else

Being able to use it on WMF controlled servers is potentially one thing, but I suspect the agreements obviously wouldn't allow this to be installed on random hosts in cloud for more public usage (and hence the reason SQL probably got the grant for that service)... But it might be fine for mw app servers to be able to query for CheckUser, as long, of course we're not creating a free public API to lookup stuff

It might not service everything, but it might do some stuff

Very minimal docs for this are on https://wikitech.wikimedia.org/wiki/Geolocation

To look up data by hand, log in to mwlog1001 or mwmaint1002 and run mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP> (see here for documentation of the returned data structure) or, if you just want a single field, something like mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP> country names en.

^ people with shell access can obviously easily poke at that

Just food for thought

@Reedy Thank you! I had no idea any of that existed. I'll poke around to see what data it provides. Looking at this doc linked from the wikitech page, it seems like it should cover most of what we want.

I'm curious how that doesn't violate the licensing policy?

But yeah I think if we're already using/paying for a dataset we should probably use that. :)

I'm curious how that doesn't violate the licensing policy?

But yeah I think if we're already using/paying for a dataset we should probably use that. :)

I'm not sure how it would.

Whereas the mission of the Wikimedia Foundation is to "empower and engage people around the world to collect and develop educational content under a free content license,"

Is mostly talking about the content...

https://foundation.wikimedia.org/wiki/Resolution:Wikimedia_Foundation_Guiding_Principles

As an organization, we strive to use open source tools over proprietary ones, although we use proprietary or closed tools (such as software, operating systems, etc.) where there is currently no open-source tool that will effectively meet our needs.