Capturing from email:
I’d like to hear some requirements (user stories?) for this UDF from researchers (Oliver, Aaron, etc.). I’m envisioning:
- A UDF to choose requestor IP given an IP and a comma delimited X-Forwarded-For field.
- A UDF to return an object with as much geocoded information as possible given an IP.
Then one could do something like:
select a.geo[‘country’], a.geo[‘region’] from ( select geocode(choose_requestor_ip(ip, x_forwarded_for)) as geo from webrequest … ) a;
Oliver:
Honestly, if we're going to build geolookup stuff that factors in XFFs, it seems like this might be a good area to work out exactly how to identify our SSL terminators: we can factor that into whether we trust XFFs, and also use it for things like pageviews-level geolocation.
Other thoughts:
It might be best for us to have distinct geolookups for country and "all", the latter structured like your example. Reasoning: a lot of the time we're just going to want country, and if you just want country, we have country-specific files, which are faster to run through than parsing down to the city level and then throwing things away. But, you know: grandmas, eggs, etc ;p.
I'd really like to UDF the "extract app UUIDs" use case. It's trivial (heck, you can do it in hive!) I just don't know how to structure it.
Other things I've been building/using that we'll need invariably at some point: MCC code extraction, and something to split language.project out of uri_host (factoring in the possible .m. and .zero. intermediary subdomains). I think that's about it for me! Happy to provide commentary on use-cases, ranges of possible values, etc.
EZ:
I would love to see lat/long and city names included in the results.
MaxMind can give both.
Actually lat/long is always there.
City names may be an extra call, and thus computationally more expensive.
Obviously these data will be sensitive (like the original ip address), but hugely useful for better maps/analyses than country level.
To be sure, this is totally different from Los Alamos project which aims for drilling down below country level but with publicly publishable data in mind.
Oliver:
Lat/long isn't there in the country-level files I've used - or the MM examples.[0] I really wish they'd settle on an API and format already ;p. Agreed on city names, hence my suggestion to split out two different functions.
[0] https://www.maxmind.com/en/geoip2-country or https://www.maxmind.com/GeoIPCountrySample.csv