Page MenuHomePhabricator

Junk in wmf.webrequest.uri_host field
Closed, ResolvedPublic

Description

Hive is awesome, and geo tags is even more awesome, thanks! Sadly I ran into a problem - seems uri_host field contains tons of junk entries that require each query to additionally filter for some unusual stuff. Could the webrequest only contain "valid" entries - something that our production cluster actually handled in some way?

Sample "weird" entries - random sites, random strings, blanks, numbers.

select distinct uri_host FROM wmf.webrequest WHERE NOT uri_host LIKE '%wik%' AND year=2015 AND month=4 AND day=10 AND hour=0;

Here are some of the most frequent cases, those that might actually cause wrong results (if filtered/processed incorrectly), rather than just annoyance (like a random web sites):

  • www.Wikipedia.org (weird casing)
  • Commons.Wikimedia.org:80 (weird casing + port)
  • varnishcheck (healthcheck)
  • 198.35.26.96, 208.80.154.224 (internal IPs e.g.)
  • 198.35.26.96:80, 198.35.26.96:80 (ip with ports)
  • many 10.* ips (for some reason geo location identifies those as British...?)
  • phab.wmfusercontent.org

Other than the mixed casing and phab, average download size is around 1.3kB, so they are probably either errors or redirects

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added a project: Analytics.
Yurik subscribed.
This comment was removed by Yurik.
Yurik set Security to None.

At the very least, please make uri_host lower case, and remove the redundant port :80 string at the end, as it only obfuscates some results. Thanks!

@Yurik, are the new normalized fields taking care of this? Can we resolve this task?

Yurik claimed this task.

It seems like its much better now, so closing. Thanks!!!