Page MenuHomePhabricator

Mobile redirects drop provenance parameters
Closed, ResolvedPublic9 Estimated Story Points

Description

When readers open a desktop Wikipedia link on a mobile device and are redirected to the mobile version, provenance (wprov) parameters in the URL are not passed along in our webrequest logs. While this is intended behavior in Varnish (to not fragment caches), it greatly complicates analysis of reader traffic being referred from external platforms like Youtube or TikTok. This appears to be unique to the wprov parameters. The goal of this task is to decide whether this should be addressed at the Varnish level (where the parameters are removed) or at the Analytics level (through more complex queries described below).

Details

Certain sites/apps provide a wprov=<value> parameter in Wikipedia links so we can see when someone uses that link (examples). The current example I'm working on is Youtube, which provides a wprov=yicw1 parameter everytime someone clicks on one of their Wikipedia fact-checking links. For example, a Youtube search for the Kecksburg UFO incident will include this link (at least in the US): https://en.wikipedia.org/wiki/Kecksburg_UFO_incident?wprov=yicw1.

That link is always to the desktop article, even when the user is on mobile. Everything works as we would hope if someone clicks on that link on desktop: we see a 200 OK http status (or 304) and pageview in the webrequest logs with wprov=yicw1 in the uri_query field and x_analytics. The issue is that when someone clicks on that link on mobile, they instead trigger a 302 redirect to https://en.m.wikipedia.org/wiki/Kecksburg_UFO_incident. That 302 redirect has the wprov information associated with it, but is not considered a pageview in our webrequest logs, so it is missing the pageview_info fields and has incorrect information for access_method (desktop instead of mobile). The resulting 200 OK for the mobile pageview then is missing the wprov information (both from uri_query and x_analytics) because this is stripped after being seen first by Varnish. This happens in about 95% of cases for Youtube given how dominant mobile is for them. We can do some workarounds (searching uri_path field instead of pageview_info for title) but they are hacky at best and far from obvious. Related: the correct referrer information is passed on to the 200, but that referrer information is missing for about half of the Youtube-originated pageviews (because of web browsers, apps, etc. making referrers complicated), so doing analyses based on referrer does not solve the issue either.

Event Timeline

some analyses we do

Who is "we"?

Is this basically the same as T66318 many years ago? :)

Who is "we"?

Research team -- more clarity added to description. But it's not research-team-specific. Anyone who does analyses that depend on the wprov parameter or other URL parameters could easily be missing a lot of data -- e.g., all of the videos created for raising Wikipedia awareness used wprov parameters, articles shared from the Wikipedia apps.

Is this basically the same as T66318 many years ago? :)

Not sure though looks like the same effect. I'm more concerned with the analytics side though, so the fix might be different (e.g, preserving the information in a different part of the request than the URL)

This task needs project tag(s) otherwise nobody will see this task (apart from subscribed folks).
However I am not sure which project tags this falls under. Maybe Wikimedia-Apache-configuration ?

Thanks @Aklapper -- I was told informally that Readers Web was where to start, which is why I looped in @ovasileva but nobody was certain who I asked.

@Isaac - unfortunately I haven't had a chance to look through this yet - tagging with our backlog for the time being so that it doesn't get lost

Thanks @ovasileva -- no worries, I just didn't want to bother a bunch of other teams if it fell under Readers Web

Jdlrobson subscribed.

@Isaac thanks for flagging the redirecting of the desktop to the mobile site uses Varnish (last time i checked), so this is likely an ops task (with our input).

Interestingly, the mobile redirect code in varnish doesn't strip any parameters. The problem is that the analytics-side VCL code that consumes the wprov parameter also removes it from the URL (so that only analytics sees it, but it doesn't affect caching or get sent to MediaWiki, etc), so it's not present anymore at the time the redirect happens.

The mobile redirect code is at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/text-frontend.inc.vcl.erb#101
While the analytics wprov code is at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/analytics.inc.vcl.erb#128

There are probably other such cases aside from the mobile redirect, since basically anything will fail to copy that parameter after it's been removed early (MW-level redirects, or other traffic ones like the http->https case and others).

We could easily restore it manually in the mobile redirect case (tacking it back on from the data stored temporarily in the X-WMF-PROV header). It might be a little trickier but more-correct to do so for all such cases and/or for all redirects.

Before we go all the way down that path, we should probably check in with Analytics (ping @Nuria) whether this is the right thing to do. It seems like it's probably removed for a good reason (as it's meant to be counted once on "entry" from the foreign site), and it might mess up their statistical account in some way if wprov is copied from the initial request to subsequent redirects (causing multiple counts for one actual case). Maybe the best approach would be to figure this is a redirect case first, and then hide the wprov parameter from analytics for the 3xx itself, then paste it back into the redirected-to URI so that it's still only counted once (although this runs the risk that we lose wprov information when the link is occasionally not followed for whatever reason).

Dzahn triaged this task as Medium priority.Jun 4 2020, 9:24 AM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Data Engineering folks, this ticket needs some input from you 😊

Before we go all the way down that path, we should probably check in with Analytics (ping @Nuria) whether this is the right thing to do. It seems like it's probably removed for a good reason (as it's meant to be counted once on "entry" from the foreign site), and it might mess up their statistical account in some way if wprov is copied from the initial request to subsequent redirects (causing multiple counts for one actual case). Maybe the best approach would be to figure this is a redirect case first, and then hide the wprov parameter from analytics for the 3xx itself, then paste it back into the redirected-to URI so that it's still only counted once (although this runs the risk that we lose wprov information when the link is occasionally not followed for whatever reason).

cc @Milimetric and @Ottomata who probably know the most about the current behaviour regarding wprov and mobile vs desktop view recording.

@BBlack: this was never our pipeline. It looks like @dr0ptp4kt's original idea was remove wprov so it doesn't fragment the cache. We don't particularly care one way or another, it doesn't affect our datasets directly. But obviously if the mechanism chosen here creates duplicate data, we should consider what we could add to the duplicate requests so they can be filtered out later. Personally, I think it's way overdue that we just instrument pageviews in a declarative way instead of parsing them out of webrequest.

I did find one dataset where wprov is used, by Product-Analytics, so perhaps @mpopov, would want to chime in: https://codesearch.wmcloud.org/analytics/?q=wprov&i=nope&files=&excludeFiles=&repos=

(but as you can see from that search, nothing else refers to it specifically)

Moving back to incoming, this is not an Ops Week task.

Thank you @Milimetric for the ping! I missed this earlier in the month.

I did find one dataset where wprov is used, by Product-Analytics, so perhaps @mpopov, would want to chime in: https://codesearch.wmcloud.org/analytics/?q=wprov&i=nope&files=&excludeFiles=&repos=

Yes we currently rely on wprov to measure Wikipedia Preview usage (T261949). We (broad "we") went with the webrequest solution because at the time we couldn't send events from third-party sites that used the feature. Neil, Kate, and Nuria worked with Legal and Security back in 2020 to draft changes to the Privacy Policy language and for a while those changes were in limbo. I see that our Privacy Policy language finally (as of June 2021) allows us to send events from third-party sites.

The other thing is, even if Inuka team updated the library to send events and wprov mechanism was disabled altogether our usage stats would be severely negatively impacted because (1) we would depend on users of this feature (third-party sites) to upgrade to the new Event Platform-using version of the library, and (2) intake-analytics.wikimedia.org is already on several ad blocker lists.

That's just some context and what I know of the situation. I have no horse in the race so I can't really speak about what could or should be done with how Wikipedia Preview usage is measured. Inuka is without a Product Manager, @nshahquinn-wmf is on sabbatical until end of June, so perhaps @SBisson can offer his thoughts here.

Thanks all for the input on this task and @BBlack especially for digging up what was happening. I finally updated the task description to reflect what I think is the current understanding of the situation but let me know if anything seems off.

I'm very intrigued @Milimetric about your comment about reinstrumenting pageviews in a declarative way (that sounds like it could help with some of our work around differential privacy too) though I assume that's a large large project.

A newer use-case / consideration around these provenance parameters: the Partnerships teams has been working more closely with external platforms and are interested in having better data about the traffic that comes to Wikipedia from these non-search platforms -- e.g., TikTok, Youtube, Traveloka, etc. Tagging @KinneretG who has been leading this and the relevant task (T305620). This would likely lead to increased usage of wprov parameters given the issues inherent with referer data not being handled well between mobile apps (raising the priority of addressing this issue). The initial discussion leaned towards trying to find a long-term solution for storing these externally-referred pageviews in a more accessible form -- e.g., incorporating this data into pageview_hourly or storing separately like referrer_daily for search referrals. If that's the case, the more complicated analytics logic is probably okay because it would happen in a single canonical place and not need to be replicated correctly for every individual analysis.

All to say, I don't think we have a solution for this yet but I'm okay with it remaining on hold while a more general solution for our non-search, external-platform traffic is discussed.

I'm very intrigued @Milimetric about your comment about reinstrumenting pageviews in a declarative way (that sounds like it could help with some of our work around differential privacy too) though I assume that's a large large project.

@Isaac: lots of us pay lots of technical debt *, on an ongoing basis, until we make this shift. Building an endpoint and a schema is simple, getting people instrumentation code is simple, and deploying all this is medium-ish, so not "large". And it's just the kind of cross-team collaboration that we *have* to get good at. So I remain convinced that it's the right way forward here. I'm just not sure what the process is to convince others, so I burden you, dear reader, with my rant. Thanks for listening.

* this task is an example, the partnership conversations that have been aimless for years, the chrome user agent changes, differential privacy, etc.

Just for the record: as @mpopov said above, Inuka is using wprov as the main source of data on Wikipedia Preview, so it would be essential to consult me/them before making any moves to deprecate it.

More broadly, even if we do switch to event-based page view counting, I'm confident there will still be cases where we need to tag certain data so it can be found in webrequest. So I think it's important to continue supporting wprov.

Realizing I never linked any code for this in case folks wanted to work with the data but here's an example where I'm trying to grab both sources: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/wprov-extraction/add-wprov-data.py?ref_type=heads#L53

Okay, so it's been a few years now and this bug still exists and impacts the quality of our analyses substantially (especially for Future Audiences experiments that are aimed at mobile users) and we're not any closer to instrumenting pageviews.

I think that instead of talking about a complete revamp of our entire pageview pipeline as an eventual (if it ever even happens) indirect solution to this problem we should just directly solve this problem.

@lbowmaker: Any chance this could be picked up this quarter after all?

Hi team - @lbowmaker asked if I could take a look at this and provide some context. I was having a think on this, and I'd like to ponder up to a few more days and provide some thoughts.

Originally, the thought was to be able to simply count relative volume of these types of inbound taps/clicks. Although we want fidelity on whether a link actually resolves to a page (and I know there are Phabricator comments about this here and elsewhere), often a simple count is sufficient to know if there's any traction whatsoever. I see that it's considered desirable to have a definite mapping of bona fide pageviews or previews (or other things of that nature) to these wprov values - makes sense.

My suggestion is to handle this in a vein similar to what @BBlack proposed, but with a twist:

  1. Rewrite redirects to have a different key in the URL, such as rprov, to establish that it was based from a redirect, but carry the value forward. Apply similar VCL rules on rprovs as with wprovs.
  2. Ensure that mdot subdomains remove this rprov value from the browser's URL bar. You'll notice that in many cases the wprov value is stripped from the URL on desktop, because of search-related code. We really ought to have this URL bar behavior on both desktop web and mobile web, and apply it to both wprov and rprov.

Why another key? This would be helpful for diagnosing possible rates of drop-off associated with bad redirects (faulty network, target page unavailable, etc.), and would be an opportunity to work with external linkers to have them update their links as appropriate.

@Isaac @mpopov @BBlack and any Data Platform Engineering folks monitoring this ticket how would you feel about this?

Some additional context for the curious:

There's a long running set of discussions about consolidating the FQDNs (e.g., instead of both en.wikipedia.org and en.m.wikipedia.org we just have en.wikipedia.org), and so suggesting to an external linker to point at an mdot domain is in some way less than ideal because maybe we'll consolidate to one domain per project-language pair at some point, but realistically, in mobile phone contexts, for the foreseeable future we're going to probably want them to be sent direct to an mdot link anyway (it doesn't make sense to have an intervening redirect). And, suppose we start sending requests in the opposite direction of mdot to a domain without mdot in it: we'd probably want to do the same thing of taking any wprov URLs on mdot domains and rewriting them to rprov on the target URL that isn't mdot (and ask any external linkers to update their stuff).

One idea with the provenance URLs was that if people have those URLs on external sites/apps and they are then re-used elsewhere, it's okay: in some sense we can point back to the very first use of a provenance URL as the reason why traffic is being destined for a particular page, because after all the backlink wouldn't exist if it hadn't started somewhere. It's easy to argue in another direction, but that was the basic idea. I don't know if I ever wrote that down, although I've mentioned it in some conversations, and thought it may be worth sharing here.

Remember, when users use the share sheet in their browsers, in general the canonical URL is the thing that will get shared, and that won't have a provenance parameter. But, when users aren't using the share sheet, they're much more likely to copy a URL bearing a provenance parameter. In the case of a desktop article, there's a decent chance the provenance parameter will have been removed by the search code as I mentioned, but I imagine there are some contexts where this isn't true (e.g., if the JS doesn't execute). In mobile web browsers, it's a reasonable enough behavior for users to go to the URL bar and copy it so they can paste it somewhere else (indeed, the share sheet targets can be glitchy, so this may be the most likely way for a person to be able to get a link sent to someone).

I had considered whether we may want to put something like rprov onto the end of a URL following a pound/hash/octothorp (e.g., https://en.wikipedia.org/wiki/Cat#rprov=something1), but this still requires JS eventing to be able to count things - the server doesn't see the stuff after the pound symbol by design of HTTP and hyperlinks So we'd be introducing potential difficulty in counting things simply. I like the general idea of instrumenting pageviews and the like with JS somehow, so that it can work in concert with server-observed traffic to understand what is real (from the perspective of a JS-capable browser) and what is more likely bot traffic (although obviously a lot of bots can and do run JS with a headless browser), and for other properties like Dan has described. For this narrow solution, though, I think we probably need to focus on putting the stuff in the URL so our server-side VCL can spot this and it goes ultimately into the data lake.

I had also considered whether we might want some sort of pseudorandom value for being able to correlate the desktop URL where a redirect occurs and the mobile URL where the user lands, so that we can more definitively say "this redirect came from this request". I think that's an interesting problem to solve on a more general basis, but for this narrow solution, I think we probably don't want to solve for the bigger thing. It would be something of a scope creep for the problem at hand, in a way.

Very excited to see this gaining some traction (thanks @mpopov and @dr0ptp4kt)! Commenting on the analytics side of things (I don't know enough about Varnish to comment on implementation details):

Rewrite redirects to have a different key in the URL, such as rprov, to establish that it was based from a redirect, but carry the value forward.

I like the idea of not burying that the redirect happened but I'm thinking about how to reach a state where analysis is simple and doesn't require knowledge of this Varnish magic and clauses that check for different variants of the wprov key. If it's not harder, could I suggest that we keep the mobile redirects as wprovstill but perhaps add a second key when the redirect happens that's like rprov=1 or something like that? So an analyst can just do a simple where x_analytics_map['wprov'] = "<my value>" instead of where (x_analytics_map['wprov'] = "<my value>" OR x_analytics_map['rprov'] = "<my value>"). And if they care about the redirects, they can add in a AND x_analytics_map['rprov'] = 1 clause. Hopefully that would save folks from still accidentally excluding legitimate pageviews. My other, perhaps naive, question is whether the http_status codes are enough to tell us when this redirect is happening and whether we need to carry forward anything wprov-specific or can just infer when a redirect happened through them?

+1 to Isaac's proposed solution of carrying wprov forward as wprov but also setting rprov=1 in case of a redirect to simplify analysis.

Okay, if I understand correctly, then the idea would be to...

  1. Continue "allowing" tagging of wprov for non-200 HTTP responses. It's mainly important people don't accidentally count those as pageviews when they're not pageviews (i.e., they should be using is_pageview or something similarly precise). It's useful to be able to quickly zoom in on these sorts of requests anyway, so even for a 30x response it is nice to have.
  1. If there's a 30x response for a redirect from desktop to mobile web and the URL came bearing a wprov, add that same wprov parameter name-value pair and also add the parameter name-value pair of rprov=1 in the target redirect URL (that's the thing that will be emitted in the Location: header).

Do I understand correctly?

Regarding the question:

whether the http_status codes are enough to tell us when this redirect is happening and whether we need to carry forward anything wprov-specific or can just infer when a redirect happened through them?

They should be sufficient for determining when a redirect is happening. I have to say there could be some edge case I haven't thought of, but nothing jumps to mind. Given that the Referer header may not be present from the UA for the GET of the URL in the Location: header of redirect, making sure the wprov name-value parameter pair and rprov=1 name-value parameter pair are added to the target URL would be crucial if wanting to cheaply ascertain that it was a redirect from a provenance URL; it we don't really care about rprov we could just forgo it, I guess. This said, I may have misunderstood the question. Possible to add some positive and negative cases of concern here if anything was in mind?

As an additional matter, technically a redirect can happen on the same domain or it can happen between separate domains, and it may not involve any Varnish redirect logic now or introduced in the future (of course the desktop-to-mobile web redirect is the most well known of the redirects, and it's Varnish-based).

Here's a basic case using just 'en.m.wikipedia.org', where we know a redirect to another domain is unlikely, meaning that the target domain of the redirect is itself 'en.m.wikipedia.org', just for the ease of being able to hand check some things.

hive (wmf)> select count(1), http_status from webrequest where year = 2024 and month = 3 and day = 21 and webrequest_source = 'text' and
> uri_host = 'en.m.wikipedia.org' and x_analytics_map['wprov'] is not null group by
> http_status;
_c0	http_status
4	404
1	301
8	302
9	304
1912	200

hive (wmf)>  select uri_path, uri_query, http_method,  http_status from webrequest where year = 2024 and month = 3 and day = 21 and webrequest_source = 'text' and
> uri_host = 'en.m.wikipedia.org' and x_analytics_map['wprov'] is not null and http_status > 200;
...

This does happen, for example when a user requests an article title that resolves to a redirect to a section of a different article, when the first character of the article title needs to be title-cased, search-y things like ns0=1 namespace parameter added onto a search URL string for the redirect target, and so on.

This most likely would occur because MediaWiki-side logic determines that it's necessary, but could happen from other parts of the stack. Technically, the MediaWiki-side logic (or other parts of the stack) should be responsible for carrying along any other URL parameters, so I think we'd want to exclude a consideration for that from the VCL (that is, the Varnish code should only care about desktop to mobile web redirects). The magnitude of these from looking at one day of en.m.wikipedia.org isn't ultra compelling right now, I think, for searching out all cases and accommodating them in Varnish code. The complexities involving # URL fragments can also be tricky. Lemme know though if you think it's worth handling. We can always add this later if we see a strong need for it, too. Interested to hear your thoughts.

Okay, if I understand correctly, then the idea would be to...

  1. Continue "allowing" tagging of wprov for non-200 HTTP responses. It's mainly important people don't accidentally count those as pageviews when they're not pageviews (i.e., they should be using is_pageview or something similarly precise). It's useful to be able to quickly zoom in on these sorts of requests anyway, so even for a 30x response it is nice to have.
  1. If there's a 30x response for a redirect from desktop to mobile web and the URL came bearing a wprov, add that same wprov parameter name-value pair and also add the parameter name-value pair of rprov=1 in the target redirect URL (that's the thing that will be emitted in the Location: header).

Do I understand correctly?

Yep, that sounds correct! Thank you!

@nshahquinn-wmf: As Business Data Steward for unique & devices and pageviews, does this solution sound good?

Okay, if I understand correctly, then the idea would be to...

  1. Continue "allowing" tagging of wprov for non-200 HTTP responses. It's mainly important people don't accidentally count those as pageviews when they're not pageviews (i.e., they should be using is_pageview or something similarly precise). It's useful to be able to quickly zoom in on these sorts of requests anyway, so even for a 30x response it is nice to have.
  1. If there's a 30x response for a redirect from desktop to mobile web and the URL came bearing a wprov, add that same wprov parameter name-value pair and also add the parameter name-value pair of rprov=1 in the target redirect URL (that's the thing that will be emitted in the Location: header).

Yes, this seems like a good solution to me!

Some small, weakly-held ideas:

  • I don't see much need to have a specific tag like rprov=1. As @Isaac said, we can break down the wprov-tagged requests by status code. @dr0ptp4kt I see your point that it would help us distinguish between the Varnish desktop-to-mobile redirection and other types of redirects. However, I think that will be an uncommon use case, which could still be mostly accomplished by looking at the request URL. Still, it doesn't do any harm either, so I don't mind it.
  • If we're going to have rprov=1, it would be slightly clearer to name it wprov_redir=1 or even wprov_redirected=1, since the length doesn't actually matter. MediaWiki happily produces query strings like title=Foo&date-range-to=2024-04-01&tagfilter=visualeditor&action=history.
LucasWerkmeister subscribed.

I believe this task can now be closed (not sure which status is best, let’s go with Resolved for now); thanks to T214998: RFC: Serve mobile and desktop variants through the same URL (unified mobile routing), opening https://en.wikipedia.org/wiki/Kecksburg_UFO_incident?wprov=yicw1 on a mobile device (tested in Firefox on Android) will now stay on the en.wikipedia.org domain (with the ?wprov= parameter intact) and display the mobile site there.