Page MenuHomePhabricator

Junk in wmf.webrequest.uri_host field
Closed, ResolvedPublic


Hive is awesome, and geo tags is even more awesome, thanks! Sadly I ran into a problem - seems uri_host field contains tons of junk entries that require each query to additionally filter for some unusual stuff. Could the webrequest only contain "valid" entries - something that our production cluster actually handled in some way?

Sample "weird" entries - random sites, random strings, blanks, numbers.

select distinct uri_host FROM wmf.webrequest WHERE NOT uri_host LIKE '%wik%' AND year=2015 AND month=4 AND day=10 AND hour=0;

Here are some of the most frequent cases, those that might actually cause wrong results (if filtered/processed incorrectly), rather than just annoyance (like a random web sites):

  • (weird casing)
  • (weird casing + port)
  • varnishcheck (healthcheck)
  •, (internal IPs e.g.)
  •, (ip with ports)
  • many 10.* ips (for some reason geo location identifies those as British...?)

Other than the mixed casing and phab, average download size is around 1.3kB, so they are probably either errors or redirects

Event Timeline

Yurik created this task.Apr 12 2015, 5:34 AM
Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added a project: Analytics.
Yurik added a subscriber: Yurik.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2015, 5:34 AM
Yurik added a comment.Apr 12 2015, 6:39 AM
This comment was removed by Yurik.
Yurik updated the task description. (Show Details)Apr 13 2015, 2:51 AM
Yurik set Security to None.
Yurik added a comment.Apr 14 2015, 4:48 PM

At the very least, please make uri_host lower case, and remove the redundant port :80 string at the end, as it only obfuscates some results. Thanks!

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 7 2015, 6:37 PM

@Yurik, are the new normalized fields taking care of this? Can we resolve this task?

Yurik closed this task as Resolved.Dec 7 2015, 9:36 PM
Yurik claimed this task.

It seems like its much better now, so closing. Thanks!!!