When readers open a desktop Wikipedia link on a mobile device and are redirected to the mobile version, provenance (wprov) parameters in the URL are not passed along in our webrequest logs. While this is intended behavior in Varnish (to not fragment caches), it greatly complicates analysis of reader traffic being referred from external platforms like Youtube or TikTok. This appears to be unique to the wprov parameters. The goal of this task is to decide whether this should be addressed at the Varnish level (where the parameters are removed) or at the Analytics level (through more complex queries described below).
Details
Certain sites/apps provide a wprov=<value> parameter in Wikipedia links so we can see when someone uses that link (examples). The current example I'm working on is Youtube, which provides a wprov=yicw1 parameter everytime someone clicks on one of their Wikipedia fact-checking links. For example, a Youtube search for the Kecksburg UFO incident will include this link (at least in the US): https://en.wikipedia.org/wiki/Kecksburg_UFO_incident?wprov=yicw1.
That link is always to the desktop article, even when the user is on mobile. Everything works as we would hope if someone clicks on that link on desktop: we see a 200 OK http status (or 304) and pageview in the webrequest logs with wprov=yicw1 in the uri_query field and x_analytics. The issue is that when someone clicks on that link on mobile, they instead trigger a 302 redirect to https://en.m.wikipedia.org/wiki/Kecksburg_UFO_incident. That 302 redirect has the wprov information associated with it, but is not considered a pageview in our webrequest logs, so it is missing the pageview_info fields and has incorrect information for access_method (desktop instead of mobile). The resulting 200 OK for the mobile pageview then is missing the wprov information (both from uri_query and x_analytics) because this is stripped after being seen first by Varnish. This happens in about 95% of cases for Youtube given how dominant mobile is for them. We can do some workarounds (searching uri_path field instead of pageview_info for title) but they are hacky at best and far from obvious. Related: the correct referrer information is passed on to the 200, but that referrer information is missing for about half of the Youtube-originated pageviews (because of web browsers, apps, etc. making referrers complicated), so doing analyses based on referrer does not solve the issue either.