Page MenuHomePhabricator

Mobile redirects drop provenance parameters
Open, MediumPublic9 Estimated Story Points

Description

When readers open a desktop Wikipedia link on a mobile device and are redirected to the mobile version, provenance (wprov) parameters in the URL are not passed along in our webrequest logs. While this is intended behavior in Varnish (to not fragment caches), it greatly complicates analysis of reader traffic being referred from external platforms like Youtube or TikTok. This appears to be unique to the wprov parameters. The goal of this task is to decide whether this should be addressed at the Varnish level (where the parameters are removed) or at the Analytics level (through more complex queries described below).

Details

Certain sites/apps provide a wprov=<value> parameter in Wikipedia links so we can see when someone uses that link (examples). The current example I'm working on is Youtube, which provides a wprov=yicw1 parameter everytime someone clicks on one of their Wikipedia fact-checking links. For example, a Youtube search for the Kecksburg UFO incident will include this link (at least in the US): https://en.wikipedia.org/wiki/Kecksburg_UFO_incident?wprov=yicw1.

That link is always to the desktop article, even when the user is on mobile. Everything works as we would hope if someone clicks on that link on desktop: we see a 200 OK http status (or 304) and pageview in the webrequest logs with wprov=yicw1 in the uri_query field and x_analytics. The issue is that when someone clicks on that link on mobile, they instead trigger a 302 redirect to https://en.m.wikipedia.org/wiki/Kecksburg_UFO_incident. That 302 redirect has the wprov information associated with it, but is not considered a pageview in our webrequest logs, so it is missing the pageview_info fields and has incorrect information for access_method (desktop instead of mobile). The resulting 200 OK for the mobile pageview then is missing the wprov information (both from uri_query and x_analytics) because this is stripped after being seen first by Varnish. This happens in about 95% of cases for Youtube given how dominant mobile is for them. We can do some workarounds (searching uri_path field instead of pageview_info for title) but they are hacky at best and far from obvious. Related: the correct referrer information is passed on to the 200, but that referrer information is missing for about half of the Youtube-originated pageviews (because of web browsers, apps, etc. making referrers complicated), so doing analyses based on referrer does not solve the issue either.

Event Timeline

some analyses we do

Who is "we"?

Is this basically the same as T66318 many years ago? :)

Who is "we"?

Research team -- more clarity added to description. But it's not research-team-specific. Anyone who does analyses that depend on the wprov parameter or other URL parameters could easily be missing a lot of data -- e.g., all of the videos created for raising Wikipedia awareness used wprov parameters, articles shared from the Wikipedia apps.

Is this basically the same as T66318 many years ago? :)

Not sure though looks like the same effect. I'm more concerned with the analytics side though, so the fix might be different (e.g, preserving the information in a different part of the request than the URL)

This task needs project tag(s) otherwise nobody will see this task (apart from subscribed folks).
However I am not sure which project tags this falls under. Maybe Wikimedia-Apache-configuration ?

Thanks @Aklapper -- I was told informally that Readers Web was where to start, which is why I looped in @ovasileva but nobody was certain who I asked.

@Isaac - unfortunately I haven't had a chance to look through this yet - tagging with our backlog for the time being so that it doesn't get lost

Thanks @ovasileva -- no worries, I just didn't want to bother a bunch of other teams if it fell under Readers Web

Jdlrobson added a subscriber: Jdlrobson.

@Isaac thanks for flagging the redirecting of the desktop to the mobile site uses Varnish (last time i checked), so this is likely an ops task (with our input).

Interestingly, the mobile redirect code in varnish doesn't strip any parameters. The problem is that the analytics-side VCL code that consumes the wprov parameter also removes it from the URL (so that only analytics sees it, but it doesn't affect caching or get sent to MediaWiki, etc), so it's not present anymore at the time the redirect happens.

The mobile redirect code is at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/text-frontend.inc.vcl.erb#101
While the analytics wprov code is at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/analytics.inc.vcl.erb#128

There are probably other such cases aside from the mobile redirect, since basically anything will fail to copy that parameter after it's been removed early (MW-level redirects, or other traffic ones like the http->https case and others).

We could easily restore it manually in the mobile redirect case (tacking it back on from the data stored temporarily in the X-WMF-PROV header). It might be a little trickier but more-correct to do so for all such cases and/or for all redirects.

Before we go all the way down that path, we should probably check in with Analytics (ping @Nuria) whether this is the right thing to do. It seems like it's probably removed for a good reason (as it's meant to be counted once on "entry" from the foreign site), and it might mess up their statistical account in some way if wprov is copied from the initial request to subsequent redirects (causing multiple counts for one actual case). Maybe the best approach would be to figure this is a redirect case first, and then hide the wprov parameter from analytics for the 3xx itself, then paste it back into the redirected-to URI so that it's still only counted once (although this runs the risk that we lose wprov information when the link is occasionally not followed for whatever reason).

Dzahn triaged this task as Medium priority.Jun 4 2020, 9:24 AM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Data Engineering folks, this ticket needs some input from you 😊

Before we go all the way down that path, we should probably check in with Analytics (ping @Nuria) whether this is the right thing to do. It seems like it's probably removed for a good reason (as it's meant to be counted once on "entry" from the foreign site), and it might mess up their statistical account in some way if wprov is copied from the initial request to subsequent redirects (causing multiple counts for one actual case). Maybe the best approach would be to figure this is a redirect case first, and then hide the wprov parameter from analytics for the 3xx itself, then paste it back into the redirected-to URI so that it's still only counted once (although this runs the risk that we lose wprov information when the link is occasionally not followed for whatever reason).

cc @Milimetric and @Ottomata who probably know the most about the current behaviour regarding wprov and mobile vs desktop view recording.

@BBlack: this was never our pipeline. It looks like @dr0ptp4kt's original idea was remove wprov so it doesn't fragment the cache. We don't particularly care one way or another, it doesn't affect our datasets directly. But obviously if the mechanism chosen here creates duplicate data, we should consider what we could add to the duplicate requests so they can be filtered out later. Personally, I think it's way overdue that we just instrument pageviews in a declarative way instead of parsing them out of webrequest.

I did find one dataset where wprov is used, by Product-Analytics, so perhaps @mpopov, would want to chime in: https://codesearch.wmcloud.org/analytics/?q=wprov&i=nope&files=&excludeFiles=&repos=

(but as you can see from that search, nothing else refers to it specifically)

Moving back to incoming, this is not an Ops Week task.

Thank you @Milimetric for the ping! I missed this earlier in the month.

I did find one dataset where wprov is used, by Product-Analytics, so perhaps @mpopov, would want to chime in: https://codesearch.wmcloud.org/analytics/?q=wprov&i=nope&files=&excludeFiles=&repos=

Yes we currently rely on wprov to measure Wikipedia Preview usage (T261949). We (broad "we") went with the webrequest solution because at the time we couldn't send events from third-party sites that used the feature. Neil, Kate, and Nuria worked with Legal and Security back in 2020 to draft changes to the Privacy Policy language and for a while those changes were in limbo. I see that our Privacy Policy language finally (as of June 2021) allows us to send events from third-party sites.

The other thing is, even if Inuka team updated the library to send events and wprov mechanism was disabled altogether our usage stats would be severely negatively impacted because (1) we would depend on users of this feature (third-party sites) to upgrade to the new Event Platform-using version of the library, and (2) intake-analytics.wikimedia.org is already on several ad blocker lists.

That's just some context and what I know of the situation. I have no horse in the race so I can't really speak about what could or should be done with how Wikipedia Preview usage is measured. Inuka is without a Product Manager, @nshahquinn-wmf is on sabbatical until end of June, so perhaps @SBisson can offer his thoughts here.

Thanks all for the input on this task and @BBlack especially for digging up what was happening. I finally updated the task description to reflect what I think is the current understanding of the situation but let me know if anything seems off.

I'm very intrigued @Milimetric about your comment about reinstrumenting pageviews in a declarative way (that sounds like it could help with some of our work around differential privacy too) though I assume that's a large large project.

A newer use-case / consideration around these provenance parameters: the Partnerships teams has been working more closely with external platforms and are interested in having better data about the traffic that comes to Wikipedia from these non-search platforms -- e.g., TikTok, Youtube, Traveloka, etc. Tagging @KinneretG who has been leading this and the relevant task (T305620). This would likely lead to increased usage of wprov parameters given the issues inherent with referer data not being handled well between mobile apps (raising the priority of addressing this issue). The initial discussion leaned towards trying to find a long-term solution for storing these externally-referred pageviews in a more accessible form -- e.g., incorporating this data into pageview_hourly or storing separately like referrer_daily for search referrals. If that's the case, the more complicated analytics logic is probably okay because it would happen in a single canonical place and not need to be replicated correctly for every individual analysis.

All to say, I don't think we have a solution for this yet but I'm okay with it remaining on hold while a more general solution for our non-search, external-platform traffic is discussed.

I'm very intrigued @Milimetric about your comment about reinstrumenting pageviews in a declarative way (that sounds like it could help with some of our work around differential privacy too) though I assume that's a large large project.

@Isaac: lots of us pay lots of technical debt *, on an ongoing basis, until we make this shift. Building an endpoint and a schema is simple, getting people instrumentation code is simple, and deploying all this is medium-ish, so not "large". And it's just the kind of cross-team collaboration that we *have* to get good at. So I remain convinced that it's the right way forward here. I'm just not sure what the process is to convince others, so I burden you, dear reader, with my rant. Thanks for listening.

* this task is an example, the partnership conversations that have been aimless for years, the chrome user agent changes, differential privacy, etc.

Just for the record: as @mpopov said above, Inuka is using wprov as the main source of data on Wikipedia Preview, so it would be essential to consult me/them before making any moves to deprecate it.

More broadly, even if we do switch to event-based page view counting, I'm confident there will still be cases where we need to tag certain data so it can be found in webrequest. So I think it's important to continue supporting wprov.

Realizing I never linked any code for this in case folks wanted to work with the data but here's an example where I'm trying to grab both sources: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/wprov-extraction/add-wprov-data.py?ref_type=heads#L53

Okay, so it's been a few years now and this bug still exists and impacts the quality of our analyses substantially (especially for Future Audiences experiments that are aimed at mobile users) and we're not any closer to instrumenting pageviews.

I think that instead of talking about a complete revamp of our entire pageview pipeline as an eventual (if it ever even happens) indirect solution to this problem we should just directly solve this problem.

@lbowmaker: Any chance this could be picked up this quarter after all?

Hi team - @lbowmaker asked if I could take a look at this and provide some context. I was having a think on this, and I'd like to ponder up to a few more days and provide some thoughts.