Page MenuHomePhabricator

Mobile redirects drop provenance parameters
Open, MediumPublic

Description

I think this is readers web domain but feel free to send me elsewhere if not. In short, when readers open a desktop Wikipedia link on a mobile device and are redirected to the mobile version, parameters in the URL are not passed along in our webrequest logs, greatly complicating some analyses that the research team does. I assume it's a bug but don't really know enough to know whether there's a reason for the behavior.

Details

Certain sites/apps provide a wprov=<value> parameter in Wikipedia links so we can see when someone uses that link (examples: https://wikitech.wikimedia.org/wiki/Provenance). The current example I'm working on is Youtube, which provides a wprov=yicw1 parameter everytime someone clicks on one of their Wikipedia fact-checking links. For example, a Youtube search for the Kecksburg UFO incident will include this link (at least in the US): https://en.wikipedia.org/wiki/Kecksburg_UFO_incident?wprov=yicw1.

That link is always to the desktop article, even when the user is on mobile. Everything works as we would hope if someone clicks on that link on desktop: we see a 200 OK http status (or 304) and pageview in the webrequest logs with wprov=yicw1 in the uri_query field and x_analytics. The issue is that when someone clicks on that link on mobile, they instead trigger a 302 redirect to https://en.m.wikipedia.org/wiki/Kecksburg_UFO_incident. That 302 redirect has the wprov information associated with it, but is not considered a pageview in our webrequest logs, so it is missing the pageview_info fields and has incorrect information for access_method (desktop instead of mobile). The resulting 200 OK for the mobile pageview then is missing the wprov information (both from uri_query and x_analytics) because this evidently is stripped in the mobile redirect. This happens in about 95% of cases for Youtube given how dominant mobile is for them. We can do some workarounds (searching uri_path field instead of pageview_info for title) but they are hacky at best and far from obvious. Related: the correct referrer information is passed on to the 200, but that referrer information is missing for about half of the Youtube-originated pageviews (because of web browsers, apps, etc. making referrers complicated), so doing analyses based on referrer does not solve the issue either.

The question is whether this is something that can be reasonably fixed? Where "fixed" is the wprov information is preserved at least in the x_analytics field for the redirected 200 pageview and ideally in the uri_query field too. I might be missing something related to cache performance or privacy or something else, but because the referrer information is preserved through the mobile redirect, I expect the uri_query parameters could be as well. I care most specifically about the wprov parameter, but presumably other URL parameters should be preserved as well.

Event Timeline

some analyses we do

Who is "we"?

Is this basically the same as T66318 many years ago? :)

Who is "we"?

Research team -- more clarity added to description. But it's not research-team-specific. Anyone who does analyses that depend on the wprov parameter or other URL parameters could easily be missing a lot of data -- e.g., all of the videos created for raising Wikipedia awareness used wprov parameters, articles shared from the Wikipedia apps.

Is this basically the same as T66318 many years ago? :)

Not sure though looks like the same effect. I'm more concerned with the analytics side though, so the fix might be different (e.g, preserving the information in a different part of the request than the URL)

This task needs project tag(s) otherwise nobody will see this task (apart from subscribed folks).
However I am not sure which project tags this falls under. Maybe Wikimedia-Apache-configuration ?

Thanks @Aklapper -- I was told informally that Readers Web was where to start, which is why I looped in @ovasileva but nobody was certain who I asked.

@Isaac - unfortunately I haven't had a chance to look through this yet - tagging with our backlog for the time being so that it doesn't get lost

Thanks @ovasileva -- no worries, I just didn't want to bother a bunch of other teams if it fell under Readers Web

Jdlrobson added a subscriber: Jdlrobson.

@Isaac thanks for flagging the redirecting of the desktop to the mobile site uses Varnish (last time i checked), so this is likely an ops task (with our input).

Interestingly, the mobile redirect code in varnish doesn't strip any parameters. The problem is that the analytics-side VCL code that consumes the wprov parameter also removes it from the URL (so that only analytics sees it, but it doesn't affect caching or get sent to MediaWiki, etc), so it's not present anymore at the time the redirect happens.

The mobile redirect code is at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/text-frontend.inc.vcl.erb#101
While the analytics wprov code is at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/analytics.inc.vcl.erb#128

There are probably other such cases aside from the mobile redirect, since basically anything will fail to copy that parameter after it's been removed early (MW-level redirects, or other traffic ones like the http->https case and others).

We could easily restore it manually in the mobile redirect case (tacking it back on from the data stored temporarily in the X-WMF-PROV header). It might be a little trickier but more-correct to do so for all such cases and/or for all redirects.

Before we go all the way down that path, we should probably check in with Analytics (ping @Nuria) whether this is the right thing to do. It seems like it's probably removed for a good reason (as it's meant to be counted once on "entry" from the foreign site), and it might mess up their statistical account in some way if wprov is copied from the initial request to subsequent redirects (causing multiple counts for one actual case). Maybe the best approach would be to figure this is a redirect case first, and then hide the wprov parameter from analytics for the 3xx itself, then paste it back into the redirected-to URI so that it's still only counted once (although this runs the risk that we lose wprov information when the link is occasionally not followed for whatever reason).

Dzahn triaged this task as Medium priority.Jun 4 2020, 9:24 AM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Data Engineering folks, this ticket needs some input from you 😊

Before we go all the way down that path, we should probably check in with Analytics (ping @Nuria) whether this is the right thing to do. It seems like it's probably removed for a good reason (as it's meant to be counted once on "entry" from the foreign site), and it might mess up their statistical account in some way if wprov is copied from the initial request to subsequent redirects (causing multiple counts for one actual case). Maybe the best approach would be to figure this is a redirect case first, and then hide the wprov parameter from analytics for the 3xx itself, then paste it back into the redirected-to URI so that it's still only counted once (although this runs the risk that we lose wprov information when the link is occasionally not followed for whatever reason).

cc @Milimetric and @Ottomata who probably know the most about the current behaviour regarding wprov and mobile vs desktop view recording.

@BBlack: this was never our pipeline. It looks like @dr0ptp4kt's original idea was remove wprov so it doesn't fragment the cache. We don't particularly care one way or another, it doesn't affect our datasets directly. But obviously if the mechanism chosen here creates duplicate data, we should consider what we could add to the duplicate requests so they can be filtered out later. Personally, I think it's way overdue that we just instrument pageviews in a declarative way instead of parsing them out of webrequest.

I did find one dataset where wprov is used, by Product-Analytics, so perhaps @mpopov, would want to chime in: https://codesearch.wmcloud.org/analytics/?q=wprov&i=nope&files=&excludeFiles=&repos=

(but as you can see from that search, nothing else refers to it specifically)