Hive is awesome, and geo tags is even more awesome, thanks! Sadly I ran into a problem - seems uri_host field contains tons of junk entries that require each query to additionally filter for some unusual stuff. Could the webrequest only contain "valid" entries - something that our production cluster actually handled in some way?
Sample "weird" entries - random sites, random strings, blanks, numbers.
select distinct uri_host FROM wmf.webrequest WHERE NOT uri_host LIKE '%wik%' AND year=2015 AND month=4 AND day=10 AND hour=0;