Hive is awesome, and geo tags is even more awesome, thanks! Sadly I ran into a problem - seems uri_host field contains tons of junk entries that require each query to additionally filter for some unusual stuff. Could the webrequest only contain "valid" entries - something that our production cluster actually handled in some way?
Sample "weird" entries - random sites, random strings, blanks, numbers.
select distinct uri_host FROM wmf.webrequest WHERE NOT uri_host LIKE '%wik%' AND year=2015 AND month=4 AND day=10 AND hour=0;
Here are some of the most frequent cases, those that might actually cause wrong results (if filtered/processed incorrectly), rather than just annoyance (like a random web sites):
- www.Wikipedia.org (weird casing)
- Commons.Wikimedia.org:80 (weird casing + port)
- varnishcheck (healthcheck)
- 198.35.26.96, 208.80.154.224 (internal IPs e.g.)
- 198.35.26.96:80, 198.35.26.96:80 (ip with ports)
- many 10.* ips (for some reason geo location identifies those as British...?)
- phab.wmfusercontent.org
Other than the mixed casing and phab, average download size is around 1.3kB, so they are probably either errors or redirects