Page MenuHomePhabricator

Mark mobile Wikipedia Preview page summary requests using wprov parameter
Closed, ResolvedPublic

Description

Wikipedia Preview already includes a wprov URL query parameter when it requests page summaries, so that we can count those requests as a proxy for the number of viewed previews. However, currently, we cannot distinguish previews shown on mobile devices from those shown on computers.

To address this, these requests should also be tagged when Wikipedia Preview considers the device a mobile device (based on having touch capabilities; this is different from MediaWiki's definition, but should produce close enough results). In this case, we should add an X-Analytics HTTP header with the value wikipedia-preview-is-touch=1. It isn't possible to use a query parameter because that will cause cache fragmentation; the wprov parameter is an exception because there is special cache logic that immediately moves the value into the X-Analytics header.

Event Timeline

SBisson edited projects, added Inuka-Team (Kanban); removed Inuka-Team.
SBisson moved this task from Backlog to Dev on the Inuka-Team (Kanban) board.

@nshahquinn-wmf when I add the header, strict-origin-when-cross-origin kicks in and the browser blocks the requests. Do you know anything about that?

If we can't solve this, would we consider distinct wprov for mobile and desktop?

As I understand it, X-Analytics is an HTTP response header, not an HTTP request header. Response headers are set server-side and travel from the backend service (e.g. MediaWiki) through our CDN edge and then onto the extenral client. Varnishkafka picks it up these values, asynchronously, from the CDN edge.

As a request header, afaik this header does not exist currently and would not have any effect, even if browsers permitted by default to send such headers across origins. I believe this restriction is by design, as it would make it too easy to inject invalid or confusing data into our Analytics pipelines. Right now, the value is considered trusted and not in need of much validation, which is possible as it only comes from responses that are naturally trusted afaik.

@nshahquinn-wmf when I add the header, strict-origin-when-cross-origin kicks in and the browser blocks the requests. Do you know anything about that?

I don't know anything about this, but I assume it's something we can't easily change!

If we can't solve this, would we consider distinct wprov for mobile and desktop?

Yes, I think we should go ahead and do that. Not sure why I didn't consider it in the first place! It would have the additional benefit that we can apply the distinction to pageviews as well as previews, and thus have exactly comparable numbers. What do you think about changing the wppw1 to wppw1t if it's a touch device?

As I understand it, X-Analytics is an HTTP response header, not an HTTP request header. Response headers are set server-side and travel from the backend service (e.g. MediaWiki) through our CDN edge and then onto the extenral client. Varnishkafka picks it up these values, asynchronously, from the CDN edge.

As a request header, afaik this header does not exist currently and would not have any effect, even if browsers permitted by default to send such headers across origins. I believe this restriction is by design, as it would make it too easy to inject invalid or confusing data into our Analytics pipelines. Right now, the value is considered trusted and not in need of much validation, which is possible as it only comes from responses that are naturally trusted afaik.

I'm not sure this is true. wikitech:X-Analytics#Keys lists a few keys where the "origin" is "client". For example, it seems like the pageview header exists so the client can tell the backend to count the request as a pageview; that wouldn't make much sense being set on a response.

I think the issue here is just that we haven't opted into allowing custom headers on cross-origin requests, which makes sense since Wikipedia Preview is our first product intended to be embedded on third-party sites.

! In T297172#7554717, @nshahquinn-wmf wrote:
What do you think about changing the wppw1 to wppw1t if it's a touch device?

Consider it done.

As this goes into code review and deployment, you can already update your query. I've done 3 API requests just now and have followed the link to Wikipedia on 2 of those previews. Let me know if you see that data.

hueitan subscribed.

PR merged.
Should we update the wordpress plugin too in the same ticket?

@SBisson thanks for this! The pull request looks good to me.

I checked the data and actually didn't find your requests there. During that hour, I only see 9 requests with the new parameter (wppw1t), all of them (1) to the API, (2) from a single Microsoft Azure IP address, and (3) about 6 minutes after your comment. That seems like the CI on the pull request.

If I look at the requests with the old parameter (wppw1), I see only one that matches your location during that hour: an API request for "Timeline of the far future" on the English Wikipedia.

I'm not really sure what's going on.

@nshahquinn-wmf I just triggered 4 requests at 2pm EST for the Ivory article on enwiki: 2 API requests (with and without the "t") and 2 viewing of the article (with and without the "t"). So all combinations should be there. They all should have "https://wikimedia.github.io/" as the referrer.

Let me know what you see. Still waiting for your cue for releasing.

nshahquinn-wmf renamed this task from Mark mobile Wikipedia Preview page summary requests using the X-Analytics header to Mark mobile Wikipedia Preview page summary requests using wprov parameter.Dec 8 2021, 7:49 PM

@SBisson I can't see the data for that hour yet since there's a delay of a couple hours in refining it. I should be able to do the check after another hour.

@SBisson I don't see any requests with that referer, those articles, or the new parameter when I search from 18:00 to 20:00 UTC today (13:00 to 15:00 EST). I do see 46 other requests with the old parameter, though. I'm not sure what's going on.

FWIW, here's the query:

SELECT *
FROM wmf.webrequest
WHERE
    x_analytics_map["wprov"] IN ('wppw1', 'wppw1t')
    AND webrequest_source = "text"
    AND year = 2021
    AND month = 12
    AND day = 8
    AND hour IN (18, 19)

I just tried broadening my search to all requests for the summary of the ivory page during those two hours (with or without a wprov parameter). I still don't see anything that looks like you. Maybe your browser had cached those requests?

I just repeated the same 4 requests at 1:30 pm EST with cache definitely disabled.

I wasn't able to find those requests, but I was finally able to find a duplicate set of requests that Stephane made on Friday. Because of how weird this has been, I'm planning to repeat the test with another engineer this week just to confirm that everything is working before we release this new code.

As I understand it, X-Analytics is an HTTP response header, not an HTTP request header. […]

I'm not sure this is true. wikitech:X-Analytics#Keys lists a few keys where the "origin" is "client". For example, it seems like […] the client can tell the backend to count the request as a pageview; that wouldn't make much sense being set on a response.

Interesting, so there is a bit of nuance here. I based my comment on the code, not the documentation. The documentation indeed mentions two attributes (preview=1 and pageview=1) as being for "clients". Looking at the Varnish code again, I overlooked one conditional branch where indeed we do inspect X-Analytics as a request header, however this inspection is very specific to those two attributes. It is not a general evaluation of the entire header value, and no other attribute keys or attributes values of X-Anaytics can be set this way, only exactly pageview=1 and preview=1 are looked for, extracted, copied out, and appended to authorized server-side response header. (Source code).

To check whether these are in active use, I used Codesearch to look for them in all Git repos, and it looks like preview=1 is no longer in use. It is still mentioned in the mediawiki/extensions/Popups.git repository (Source code) but only in tests and unused code for the old TextExtract API, not for the REST API that we use in production. It is also mentioned in the analytics/refinery.git repository, but it seems ineffective as the only thing it does is reject it as pageview, whereas the TextExtract API url pattern already wouldn't qualify as pageview since it is rejected both due to containing "api.php", and due to not carrying any of the query parameters needed to know which page is being viewed.

The pageview=1 attribute is working and in use (Source code).

In short: It is an HTTP response header, and most attributes like title, special, wprov, etc cannot be set client-side. There is an exemption specifically for "preview=1" (unused, appears ineffective) and "pageview=1" (in use).

nshahquinn-wmf moved this task from QA to Dev on the Inuka-Team (Kanban) board.

I triggered another series of requests myself, and I was able to find them in webrequest without a problem. I'm now comfortable saying this is consistently working, so this will be resolved as soon as one of Inuka's engineers releases the code.

In short: It is an HTTP response header, and most attributes like title, special, wprov, etc cannot be set client-side. There is an exemption specifically for "preview=1" (unused, appears ineffective) and "pageview=1" (in use).

Thanks for digging into this and writing a detailed explanation! I've made a couple of edits to the Wikitech page to capture some of this information.

@nshahquinn-wmf this was released today. I'll let you resolve when you judge appropriate.

Since the release, our webrequest data does show some requests with the new wppw1t parameter coming from third-party sites:

time           wppw1t_requests
2022-01-04     9
2022-01-05     2
2022-01-06     3
2022-01-08    14
2022-01-09     2

Full analysis in this notebook on GitHub.

I'm calling this resolved.