Page MenuHomePhabricator

Draft: Build a prototype for the IP info feature
Closed, ResolvedPublic

Description

Build a prototype of the proposed IP info feature to test technical feasibility of the planned feature and discover any pitfalls. If successful, develop the prototype into the full feature. If unsuccessful, iterate on the product, using lessons learnt.

The full feature will provide critical information about an IP address, and is available to trusted users on a wiki. It will collate information from several services, which may include paid-for services with proprietary data (see T248525 for more information). It has two major components: (1) a tooltip showing basic information, which appears on hovering on an IP address; (2) a special page showing more detailed information.

Outline of the prototype

We expect the main technical challenges to concern:

  • Accessing the services while preserving privacy
  • Performance of the tooltip, given the amount of data we'd be requesting

Therefore for the a prototype we will build the tooltip, pulling in data from whichever service(s) we can gain access to soonest. (@dbarratt which service(s) would be best for this?)

Further details:

  • An extension allows trusted users to see the tooltip on hovering on an IP address link
  • The tooltip displays basic information such as location & map, organization, Tor/VPN/blacklist information (depending on what's available from the services we choose to start with)
  • The user must first click to agree they are viewing the information to fight vandalism

@Prtksxna which mockup would you recommend working from for the prototype?

Measures for success

To be discussed with @Niharika

Event Timeline

Therefore for the a prototype we will build the tooltip, pulling in data from whichever service(s) we can gain access to soonest. (@dbarratt which service(s) would be best for this?)

That's a great question. Most of these services offer a free tier (with very basic data) and some of them (like ipinfo.io) you don't even have to sign up. As an example:

curl --header "Accept: application/json" https://ipinfo.io/8.8.4.4

though to getting more details, requires creating (at least) a free account.

For the purposes of a prototype, I think it's totes fine to display the basic info and expand it later. Also note that any of them are going to give us an API key that will need to remain "secret" (as in, not exposed to the client).

There are some details (like the GeoNames ID) that will only be added to Wikimedia's response (upon request), but I think it's fine if we display the results in English for now.

I have a slight preference for ipinfo.io since they seem more "developer friendly" (I realize that is 100% subjective), but feel free to use whatever is easiest!

Just FYI, a cursory glance at the documentation indicates that the free plan does not include the "connection type", and only really exposes data available via whois. Real, usable data begins with the $249/mo "standard" plan.

Signed up for an account, to confirm:

$ curl ipinfo.io/8.8.8.8?token=<nope>
  "ip": "8.8.8.8",
  "hostname": "dns.google",
  "city": "Mountain View",
  "region": "California",
  "country": "US",
  "loc": "37.3860,-122.0838",
  "org": "AS15169 Google LLC",
  "postal": "94035",
  "timezone": "America/Los_Angeles"

Just FYI, a cursory glance at the documentation indicates that the free plan does not include the "connection type", and only really exposes data available via whois. Real, usable data begins with the $249/mo "standard" plan.

Signed up for an account, to confirm:

$ curl ipinfo.io/8.8.8.8?token=<nope>
  "ip": "8.8.8.8",
  "hostname": "dns.google",
  "city": "Mountain View",
  "region": "California",
  "country": "US",
  "loc": "37.3860,-122.0838",
  "org": "AS15169 Google LLC",
  "postal": "94035",
  "timezone": "America/Los_Angeles"

Thanks @SQL. This is helpful to know. I have reached out to IPInfo.io to see if they can set us up with some trial accounts with access to the real data while we explore building this feature.

As a heads up:

Just incase you weren't aware, we do have MaxMind stuff in production too, including some dataset that WMF pays for - https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/puppetmaster/manifests/geoip.pp#L23-L24 and https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/puppetmaster/manifests/geoip.pp#L31-L42

It should have things like their version of geolocation, "owners"/ISPs etc

Might be worth talking to SRE to see what exactly the WMF has, and whether we can reuse it (I don't have any idea of the scope of our contracts with them), rather than necessarily having to buy/use something else

Being able to use it on WMF controlled servers is potentially one thing, but I suspect the agreements obviously wouldn't allow this to be installed on random hosts in cloud for more public usage (and hence the reason SQL probably got the grant for that service)... But it might be fine for mw app servers to be able to query for CheckUser, as long, of course we're not creating a free public API to lookup stuff

It might not service everything, but it might do some stuff

Very minimal docs for this are on https://wikitech.wikimedia.org/wiki/Geolocation

To look up data by hand, log in to mwlog1001 or mwmaint1002 and run mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP> (see here for documentation of the returned data structure) or, if you just want a single field, something like mmdblookup --file /usr/share/GeoIP/GeoIP2-City.mmdb --ip <IP> country names en.

^ people with shell access can obviously easily poke at that

Just food for thought

@Reedy Are we adding the Geolocation data to the request headers that end up at MediaWiki? If not, could we? @dmaza and I were talking, and I wonder if it would be better for us to record the incoming request headers (like we do for User-Agent) instead of attempting to make a database that we can query all the time. This would actually solve a lot of potential performance problems as well. Though if not we could do the lookup when an edit is made rather then when someone is trying to lookup the data.

@Reedy Are we adding the Geolocation data to the request headers that end up at MediaWiki? If not, could we?

I know we do set a GeoIP cookie (in Varnish AFAIK) and send that back to the client. I don't know offhand if we send anything in the form of headers passed to MW.

It doesn't seem out of the realms of possibility to do something like that. But you'd really need to ask Traffic/SRE

It doesn't seem out of the realms of possibility to do something like that. But you'd really need to ask Traffic/SRE

Thanks! @Niharika I think we should definitely do that. Ideally some service would add a Wikidata ID (or GeoNames ID) for the location and a Wikidata ID (or an ASN) for the ISP to the request that is being sent to MediaWiki (and we'd store that somewhere).

Alternatively, we could do all that inside MediaWiki when an edit is made. :)

It doesn't seem out of the realms of possibility to do something like that. But you'd really need to ask Traffic/SRE

Thanks! @Niharika I think we should definitely do that. Ideally some service would add a Wikidata ID (or GeoNames ID) for the location and a Wikidata ID (or an ASN) for the ISP to the request that is being sent to MediaWiki (and we'd store that somewhere).

Alternatively, we could do all that inside MediaWiki when an edit is made. :)

@dbarratt Would you like to open a phabricator ticket that we can use to discuss this with Traffic/SRE? That might be the most expedient way to proceed on this.

@dbarratt Would you like to open a phabricator ticket that we can use to discuss this with Traffic/SRE? That might be the most expedient way to proceed on this.

Done! T251933

@dbarratt Would you like to open a phabricator ticket that we can use to discuss this with Traffic/SRE? That might be the most expedient way to proceed on this.

Done! T251933

Thanks David!

Here's some thoughts I had about prototyping for IP Info feature from a conversation I had this morning with Prateek.

Questions we want answered from the prototype (non exhaustive):

User facing
  • Is the surfaced information useful?
  • How do they use this data?
  • How does it fit within their workflow?
  • How intuitive is the UI?
  • Some sense of how often we should expect this feature to be used (hard to know but we can try)
  • Doing some experimentation to figure out what works well, what doesn't. Iterate and experiment again.
Technical questions
  • Comparing the different APIs - what data do they return, how fast, how reliable they are, pricing etc. We can build off of the spreadsheet David created and add more details and caveats as we discover them.
    • Currently looking at IPInfo.io, Auth0 signals and Maxmind (contingent on T251933)
    • Doing the above comparison before we get into contracts with any service would be ideal. We can do this with their free credits. IPInfo has offered us some. Aida and I can coordinate to get some for Auth0 signals.
  • What's a reasonable expectation for time user will have to spend waiting for the results to be fetched?
    • Will this vary by page? History versus recent changes etc.
    • Thinking about design considerations that we might need to build in depending on data fetch time.
  • Can we do caching on our end to save time?
  • Can we hold on to data that a user has queried for in a session for the duration of the session?
  • Can we get translated data?

What else?

  • What's a reasonable expectation for time user will have to spend waiting for the results to be fetched?

If we query on-demand, I imagine it will take less than a second to make the request and get a result back. If the result is already cached by the 3rd party (which is likely) then I imagine it will be sub-100ms. (this is just a guess)

If we make requests to more than one service, we should make them concurrently (either in the browser, or with curl on the server).

If the data is looked up on action (as described in T251933) then that data will be in the database (like IP) and we can deliver it at the same time as everything else (which I think would be preferable?). This would allow us to query based on that data and also make aggregate queries. The downside is that we can't use an external service in that scenario, it would need to be a database that is available at the time of action (but apparently already is?).

  • Will this vary by page? History versus recent changes etc.

No.

  • Thinking about design considerations that we might need to build in depending on data fetch time.

If it requires an external request, it wont be instant and could fail, so I would account for that. :) If the data is already in the database T251933, it will be instant.

  • Can we do caching on our end to save time?

Yes? We could either cache the results in an HTTP cache like Varnish (which should happen automatically if we use RESTBase) or we could use the object cache (like Redis). Knowing how long we can cache the external data for might be the only challenge here.

  • Can we hold on to data that a user has queried for in a session for the duration of the session?

We could technically hold it for as long as we are allowed to retain it. I don't imagine it will change often. I imagine these services will give us a Cache-Control header that will instruct us on how long we can cache their data for (if not, we should probably ask for that).

  • Can we get translated data?

Yes? From my research they can either give you some translated place names (though not very many, and I think they come from GeoNames) and no translated ISPs or they can give you the GeoNames ID & the ASN. I think it would be better to get all the translated names from Wikidata (matching with the GeoNames IDs & ASNs), which will also give us a place to send translators if they encounter a place or ISP that is not translated.

As far as the prototype goes, I would focus on English-only for now since it's a lot easier. :)

Niharika claimed this task.
Niharika added a parent task: T285977: IP Info.

We did this with a gadget.