Hive is awesome, and geo tags is even more awesome, thanks! Sadly I ran into a problem - seems uri_host field contains tons of junk entries that require each query to additionally filter for some unusual stuff. Could the webrequest only contain "valid" entries - something that our production cluster actually handled in some way?
Sample "weird" entries - random sites, random strings, blanks, numbers.
select distinct uri_host FROM wmf.webrequest WHERE NOT uri_host LIKE '%wik%' AND year=2015 AND month=4 AND day=10 AND hour=0;
Here are some of the most frequent cases, those that might actually cause wrong results (if filtered/processed incorrectly), rather than just annoyance (like a random web sites):
- www.Wikipedia.org (weird casing)
- Commons.Wikimedia.org:80 (weird casing + port)
- varnishcheck (healthcheck)
- 188.8.131.52, 184.108.40.206 (internal IPs e.g.)
- 220.127.116.11:80, 18.104.22.168:80 (ip with ports)
- many 10.* ips (for some reason geo location identifies those as British...?)
Other than the mixed casing and phab, average download size is around 1.3kB, so they are probably either errors or redirects