Page MenuHomePhabricator

Geo-coding UDF
Closed, ResolvedPublic

Description

Capturing from email:

I’d like to hear some requirements (user stories?) for this UDF from researchers (Oliver, Aaron, etc.). I’m envisioning:

  • A UDF to choose requestor IP given an IP and a comma delimited X-Forwarded-For field.
  • A UDF to return an object with as much geocoded information as possible given an IP.

Then one could do something like:

select a.geo[‘country’], a.geo[‘region’] from (

      select geocode(choose_requestor_ip(ip, x_forwarded_for)) as geo
      from webrequest …
) a;

Oliver:

Honestly, if we're going to build geolookup stuff that factors in XFFs, it seems like this might be a good area to work out exactly how to identify our SSL terminators: we can factor that into whether we trust XFFs, and also use it for things like pageviews-level geolocation.

Other thoughts:
It might be best for us to have distinct geolookups for country and "all", the latter structured like your example. Reasoning: a lot of the time we're just going to want country, and if you just want country, we have country-specific files, which are faster to run through than parsing down to the city level and then throwing things away. But, you know: grandmas, eggs, etc ;p.
I'd really like to UDF the "extract app UUIDs" use case. It's trivial (heck, you can do it in hive!) I just don't know how to structure it.
Other things I've been building/using that we'll need invariably at some point: MCC code extraction, and something to split language.project out of uri_host (factoring in the possible .m. and .zero. intermediary subdomains). I think that's about it for me! Happy to provide commentary on use-cases, ranges of possible values, etc.

EZ:

I would love to see lat/long and city names included in the results.
MaxMind can give both.
Actually lat/long is always there.
City names may be an extra call, and thus computationally more expensive.
Obviously these data will be sensitive (like the original ip address), but hugely useful for better maps/analyses than country level.

To be sure, this is totally different from Los Alamos project which aims for drilling down below country level but with publicly publishable data in mind.

Oliver:
Lat/long isn't there in the country-level files I've used - or the MM examples.[0] I really wish they'd settle on an API and format already ;p. Agreed on city names, hence my suggestion to split out two different functions.

[0] https://www.maxmind.com/en/geoip2-country or https://www.maxmind.com/GeoIPCountrySample.csv

Event Timeline

Tnegrin assigned this task to Ottomata.
Tnegrin raised the priority of this task from to Medium.
Tnegrin updated the task description. (Show Details)
Tnegrin changed Security from none to None.

from Dec 9, 2014 RDStandup: have Reid review code

I've spent the week integrating the new maxmind API. If we're going to build geolocation UDFs, we should absolutely be using the new API, not the old one: the V2 files have muuuuuch better IPV6 resolution! See https://github.com/maxmind/GeoIP2-java

I notified Reid of the existence of this thread.

Should we add Reid to this ticket?

Ok, here is what we probably want:

  1. A class in refinery-core containing geocoding methods. Something like:
    1. getCountryCode(ipAddress) - returns two letter country code.
    2. getGeocodedData(ipAddress) - returns map with geocoded data, e.g. { country_code: US, latitude: ... }
    3. These methods should work with both IPv4 and IPv6 addresses.
  2. Tests in refinery-core for each of these.
  3. UDF classes in refinery-hive:
    1. GeocodedCountryUDF - this will likely be an instance of org.apache.hadoop.hive.ql.exec.UDF;
    2. GeocodedData - this will likely be an instance of org.apache.hadoop.hive.ql.udf.generic.GenericUDF, as it is more complex.

The end goal is for users to be able to create functions in Hive to do:

select geocoded_country(ip) from webrequest ...

and

select a.geodata[‘country’], a.geodata[‘region’] from (

      select geocoded_data(ip) as geodata
      from webrequest …
) a;

(Class names and code structure here are not strict, we might want to change them as we start development and are better informed.)

This should use the new GeoIP2 Maxmind API.

There is a preliminary work in progress change for GeocodedCountryUDF here: https://gerrit.wikimedia.org/r/#/c/164264. We can use this or abandon it in favor of a new one. It might be easier to start a new change.

We should track the requestor-ip-from-XFF UDF as a separate ticket.

That makes sense. Are we going to use the v1 or v2 maxmind API?

"This should use the new GeoIP2 Maxmind API." :D Why not, eh?

Folks, thanks for adding me to this discussion. I am attaching our draft proposal to let you know what we are thinking. Any feedback is most welcome.

This looks awesome!

Quick question. Would Maxmind's geo database hierarchies be sufficient for what you need? This ticket is about using Maxmind's API to allow for geocoding of IP addresses.

I'm not familiar with Maxmind. We had been assuming the use of some database to map IP addresses to point locations, which we would then map to a location hierarchy. If the database emits nodes in a location hierarchy directly, that's fine too.

It would be nice if that hierarchy matched some open geographic dataset such as Natural Earth.

Interesting thought. We've used the maxmind db exclusively for a long time so I think it's part of the MVP, we should think about additional databases.

we could have a parameter in the UDF or rename it "maxmind_geocoded_country(ip)" or something like that.

we could have a parameter in the UDF or rename it "maxmind_geocoded_country(ip)" or something like that.

wouldn't it make more sense to have a field that stores the name and version of the DB we're using? This will allow us to reconstruct which requests were processed via a specific DB and handle upgrades (which may change the DB definition) gracefully. Or are you thinking that the same HTTP request could be tagged based on multiple DBs.

wouldn't it make more sense to have a field that stores the name and version of the DB we're using?

At this time the code is not pluggable to any db and assumes maxmind return types. I think that noting what db version was used when analyzing the data is best done on the dataset itself, not the code. Ideally code should not change as we upgrade db versions. The maxmind db is not part of the UDF, rather the UDF uses the database (which will always be subjected to frequent updates).

As toby said if we want to allow to plug different dbs code needs to be changed quite a bit.

DarTar, I do like that idea though. When I incorporate this UDF into the refined table, I'll see if I can tag the line with the version of the db as well.

Oh forgot I said we would do that. Hm.
CCing Joseph.