Page MenuHomePhabricator

Create new normalized uri_host field in refined webrequest table {hawk} [13 pts]
Closed, ResolvedPublic

Description

uri_host sometimes contains upper case letters and port 80. Neither information is of much use (we can track bugs via the raw requests without dealing with the clean webrequest data), but it often breaks regexes used to parse it (people might forget the uppercase or that port 80 could be part of the URI. Please remove it. Thanks!!!

Related to T95836.

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added a project: Analytics.
Yurik added subscribers: Yurik, Ottomata, Ironholds.
Yurik set Security to None.
Ottomata renamed this task from Make all wmf.webrequest.uri_host lower case, and remove ":80" at the end to Create new normalized uri_host field in refined webrequest table..Apr 14 2015, 5:19 PM

I don't think we should alter the data in this field, but I would be fine with adding a new field that contained normalized uri_host. 'uri_host_normalized'? 'uri_host_n'? In general I don't like abbreviations, but I could see this being something that more fields might want.

Should we do this at all?

Ottomata added a subscriber: JAllemandou.

I don't think we should alter the data in this field, but I would be fine with adding a new field that contained normalized uri_host. 'uri_host_normalized'? 'uri_host_n'? In general I don't like abbreviations, but I could see this being something that more fields might want.

Should we do this at all?

funny dup.

I think the new field should only contain "proper" uri hosts, like those controlled by wmf. All else should be blank.

I use this to convert, which does not catch all bad hosts

if (lower(uri_host) RLIKE '^([a-z0-9-]+\\.)*[a-z]*wik[it][a-z]*\\.[a-z]+(:80)?$',
  COALESCE(regexp_extract(lower(uri_host), '^([^:]+)(:80)?$', 1), ''),
  '-') uri_host,

I filed this for a similar usecase - https://phabricator.wikimedia.org/T98257. That would be to convert the uri_hosts to one of these - https://github.com/wikimedia/operations-mediawiki-config/blob/master/all.dblist. This doesn't differentiate between mobile and desktop - but webrequest_source should have that information I think.

We have a UDF that extracts the project part of the host (en.m.wikipedia.org --> en.wikipedia for instance).
A "project" filed will soon be added in the refined webrequest table using that UDF.
Would that answer the need?

@JAllemandou, not fully - ideally I would like any "non-normalized" hosts such as wikpedia.bg to be shown as '-', and only our proper normalized urls to be in that field (en.zero.wikipedia.org), or break it into multiple subfields {"lang":"en", "subdomain":"zero", "project":"wikipedia"}. Of course if you want to provide a standard way to parse/normalize it via a UDF, that is also workable, as long as I can fully filter and group by on those subfields.

Filtering for correct hosts is done on our common use cases using the is_pageview flag.
I am not sure we want to add another domain formatting field.
@Ottomata: preferences ?

is_pageview will help, but this could be useful for non pageviews too, so it might be nice to have. Yurik, do you really need bad host/projects removed from this? Couldn't you just whitelist a bunch? If we did normalize, I think I'd rather just do transformations (lowercasing, removing :80, etc.) than attempting to whitelist during refinement. However, a UDF that contained a whitelist that could be used for filtering after the fact would be good. E.g. where uri_host_normalized in whiltelisted_uri_hosts() or something.

As for multiple subfields, I could see that be very useful. Joseph, since we are not yet refining the project field, maybe we should slightly reconsider what Yurik suggests. Maybe a map with the fields Yurik suggests would be more useful.

kevinator renamed this task from Create new normalized uri_host field in refined webrequest table. to Create new normalized uri_host field in refined webrequest table {hawk} [13 pts].May 26 2015, 6:52 PM
kevinator triaged this task as Medium priority.

So, I think we could add a field lowering the host, removing the port if any, and splitting the domain into a map like that:
domains_map = {domain: "wikipedia", subdomains: ["en", "m"]} for instance.

So, I think we could add a field lowering the host, removing the port if any, and splitting the domain into a map like that:
domains_map = {domain: "wikipedia", subdomains: ["en", "m"]}
for instance.
What do you think ?

That would make the implementation much easier, as we wouldn't have to code in WMF project/subdomain/mobile/zero/whateverelse specific logic. But, it seems like this might be less useful for @Yurik and @madhuvishy, so I'd like them to comment.

Lower and 80 is good, but i think we should really solve the common problem
here of unusual domains - let's only allow something that would be useful -
allow (langcode).(m|zero|nothing).(any-of-wmf-projects).org

Everything else should be stored as an empty value/null/-.

This would easily allow us to analyze usage by domain/language/project,
without overwhelming with bogus stuff being sent in the host header or
redirect domains such as wikipedia.bg.

As per one of @Ottomata previous comments, do we whitelist here or somewhere else ?
Also, the definition (langcode).(m|zero|nothing).(any-of-wmf-projects).org is not correct -->

  • some projects don't have langcode (commons|meta|incubator|species|outreach to be precise)
  • subdomains can be (m|mobile|wap|zero)

I don't mind whitelisting here or somewhere else, but I need consensus :)

This is exactly my point -- I don't know all the valid domains, and neither
are most of researchers. The rule is basically: only allow domains that do
not redirect to other domains for all traffic, nor show an error for all
traffic.

Without a whitelist, @Yurik's rule isn't codeable. Maybe we should make something as simple as we can without a whitelist, but still take into account specific field names, rather than just splitting subdomains into an array. I'm not quite sure what this would be, but I betcha both @Ironholds and @Halfak would have advice.

Hey you two! Do you know of any documentation out there of the possible forms of WMF domains?

:( Nope. Hopefully @Ironholds has seen something like that.

I would have no objections to making multiple fields instead of one --
language, sub domain, project. Than we can drop the www for any domain that
uses it in normalized format

Having dedicated fields instead of a generic one is the difficult part.
For instance the language one is not present in every project, etc.
For me this goes back to the question "What is a correct domain for the foundation and how do we identify it?"
I think it requires a bunch of analysis to ensure that we cover most of the cases, and know about our error rate.

I'm...pretty confused by all of this. I came up with the logic for this problem literally four months ago, while on the analytics team, with an accompanying mental model that avoids cerebral segfaults around languages that aren't languages.

We don't call it a language, we call the lowest-level ("en.") subdomain a "project" and the higher level ("wikipedia.org") a project class. And then we instantiate a three-element map, "project" (string), "project_class" (string) "mobile" (bool, false by default):

  1. Strip out ":.*" to remove port.
  2. Separate by periods.
  3. If the first chunk is "www", drop.
  4. If the second (remaining) chunk is wap, mobile, m, or zero, "mobile" is true. Drop second remaining chunk.
  5. If there are >2 remaining chunks, no idea what this is. NULL both.
  6. If there is 1 remaining chunk, that's the project class. Set project to NULL.
  7. If there are 2 remaining chunks, project is the first and project_class is the second.

We avoid brittle regexes and we get what we want.

I think that was adapted from my original code but not actually said original code (it might be, but I don't use switches ;p)

Did you actually implement? This thing only returns project, but not project class. There is more than just 'mobile' too. What about zero, etc?

Then this was definitely _not_ my original code because I wouldn't have forgotten zero!

Sorry, take that back, it does return project class. This is basically a normalized domain as is. e.g. en.wikipedia, etc.

Well, I mean, it seems that this is attempting to return a normalized domain, independent of subdomains like m and zero. Did you actually write something that does what you say? I can't remember.

I did write that, expecting to use is_zero and access_method to know whether zero or mobile.

Iszero is based on the presence of zero tag in xanalytics, not the zero sub
domain in URL, from what I understood.

Change 215628 had a related patch set uploaded (by Joal):
Add host normalization udf and needed functions

https://gerrit.wikimedia.org/r/215628

@kevinator why is getting data out of the UA field a factor in host normalisation? Host normalisation does nothing to the data you'd need for that heuristic set.

Change 215628 merged by Madhuvishy:
Add host normalization udf and needed functions

https://gerrit.wikimedia.org/r/215628

Change 217229 had a related patch set uploaded (by Joal):
Add 2 fields and 1 update to refine table

https://gerrit.wikimedia.org/r/217229

Change 217229 merged by Ottomata:
Add 2 fields and 1 field update to refine table

https://gerrit.wikimedia.org/r/217229

verified done. It is in the refined logs now. Example struct for normalized_host:
{"project_class":"wikipedia","project":"pt","qualifiers":["m"],"tld":"org"}