Page MenuHomePhabricator

Junk in wmf.webrequest.uri_host field
Closed, ResolvedPublic

Description

Hive is awesome, and geo tags is even more awesome, thanks! Sadly I ran into a problem - seems uri_host field contains tons of junk entries that require each query to additionally filter for some unusual stuff. Could the webrequest only contain "valid" entries - something that our production cluster actually handled in some way?

Sample "weird" entries - random sites, random strings, blanks, numbers.

select distinct uri_host FROM wmf.webrequest WHERE NOT uri_host LIKE '%wik%' AND year=2015 AND month=4 AND day=10 AND hour=0;

Here are some of the most frequent cases, those that might actually cause wrong results (if filtered/processed incorrectly), rather than just annoyance (like a random web sites):

  • www.Wikipedia.org (weird casing)
  • Commons.Wikimedia.org:80 (weird casing + port)
  • varnishcheck (healthcheck)
  • 198.35.26.96, 208.80.154.224 (internal IPs e.g.)
  • 198.35.26.96:80, 198.35.26.96:80 (ip with ports)
  • many 10.* ips (for some reason geo location identifies those as British...?)
  • phab.wmfusercontent.org

Other than the mixed casing and phab, average download size is around 1.3kB, so they are probably either errors or redirects

Event Timeline

Yurik created this task.Apr 12 2015, 5:34 AM
Yurik updated the task description. (Show Details)
Yurik raised the priority of this task from to Needs Triage.
Yurik added a project: Analytics.
Yurik added a subscriber: Yurik.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2015, 5:34 AM
Yurik added a comment.Apr 12 2015, 6:39 AM
This comment was removed by Yurik.
Yurik updated the task description. (Show Details)Apr 13 2015, 2:51 AM
Yurik set Security to None.
Yurik added a comment.Apr 14 2015, 4:48 PM

At the very least, please make uri_host lower case, and remove the redundant port :80 string at the end, as it only obfuscates some results. Thanks!

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 7 2015, 6:37 PM

@Yurik, are the new normalized fields taking care of this? Can we resolve this task?

Yurik closed this task as Resolved.Dec 7 2015, 9:36 PM
Yurik claimed this task.

It seems like its much better now, so closing. Thanks!!!