Page MenuHomePhabricator

Distinguish Wikipedia Preview content requests and clickthroughs in the webrequest logs
Closed, ResolvedPublic

Description

We want to tag the following types of Wikipedia Preview webrequests in the logs, so they can be counted by an Oozie job:

  1. API requests for preview content, triggered when a user hovers or taps on a Wikipedia Preview link
  2. Regular requests for a Wikipedia article, triggered when a user clicks the link in the preview

The tag must not be applied to other types of webrequests generated by Wikipedia Preview. Ideally, the tag should be the same for both types of requests, but this isn't strictly necessary.

Options include:

  1. Adding a value to the X-Analytics header, which contains key-value pairs formatted as in this example: ns=-1;special=Userlogin;WMF-Last-Access=09-May-2017;WMF-Last-Access-Global=09-May-2017;https=1. These key-value pairs are parsed into a map field in webrequest, so they'll be easy to access. However, we cannot change headers for requests generated by links.
  2. Adding a query parameter to the URL. This is not as easy to access; webrequest copies the query portion of the URL into a dedicated field, but doesn't parse it. This is the preferred option. We will try it first and reconsider if it doesn't work.

We will also need to be able to detect which partner the site is coming from. The options are:

  1. Relying on the referrer field. This will be set automatically, but it would create difficulties if some partners have content on multiple domains or subdomains. The referrer field isn't specially parsed. This is the preferred option. We will try it first and reconsider if it doesn't work.
  2. Use the tag (whether in X-Analytics or the query string) to indicate the partner. This will require a little bit of manual configuration for each partner, but will produce cleaner data.

Event Timeline

AMuigai triaged this task as Medium priority.Sep 9 2020, 11:41 AM
SBisson claimed this task.Sep 10 2020, 2:46 PM
SBisson updated the task description. (Show Details)
SBisson moved this task from Ready for Dev to Dev on the Inuka-Team (Kanban) board.

@nshahquinn-wmf do you have a preference for the special query parameter name and value? It really doesn't matter technically, it can be ?foo=bar but there might be some standards like ?wp_prov=project1 or something.

@SBisson I checked with my team and it turns out there is actually a standard which will simplify things a lot for us!

If we add wprov=foo to the query string, the Varnish servers will automatically move it into the X-Analytics header, preventing cache fragmentation and making it nicely queryable in the process. Also, that means we should be able to put it directly in the header for the API request and have it end up in the same place, but I don't think anyone else has done that so we should probably check first.

Docs are here: https://wikitech.wikimedia.org/wiki/Provenance

Varnish code for wprov and X-Analytics here: https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/analytics.inc.vcl.erb#L128

Nuria added a subscriber: Nuria.Sep 14 2020, 6:29 PM

Please be so kind to consider event solutions that are build for this purpose, we have been working on MEP for years and this is a perfect use case for it. X-Analytics is a solution that much predates the event framework and that to be perfectly honest we cannot support its usage in cases like this.

Also, that means we should be able to put it directly in the header for the API request and have it end up in the same place, but I don't think anyone else has done that so we should probably check first.

mmm, no, things do not work that way. Please see my comment above, this is a perfect use case for events.

mpopov added a subscriber: mpopov.Sep 15 2020, 2:19 PM

Please be so kind to consider event solutions that are build for this purpose, we have been working on MEP for years and this is a perfect use case for it.

To re-iterate from our chat on IRC because it doesn't seem like this was clear: an event solution was considered and indeed was the immediately-go-to solution for this, but is currently blocked by approval from Legal and policy for sending events to MEP intake service from other websites.

Nuria added a comment.Sep 15 2020, 2:21 PM

approval from Legal and policy for sending events to MEP intake service from other websites.

There seems to be some confusion here, we have been receiving and processing events from other websites for years. Probably is worth bringing this fact to the attention of the legal team and clearing any miss understandings.

SBisson closed this task as Declined.Sep 22 2020, 1:29 PM

We won't do that.

See T256398 for the new plan.

SBisson reopened this task as Open.Sep 24 2020, 1:42 PM

Just to confirm: this uses the query parameter (rather than the header) for both the preview fetches and the links, right? That seems like a good approach, since it avoids the uncertainty about how the header would be handled.

Just to confirm: this uses the query parameter (rather than the header) for both the preview fetches and the links, right? That seems like a good approach, since it avoids the uncertainty about how the header would be handled.

That's right. It includes wprov=wppw1 when calling the summary API and linking to the article on Wikipedia.

SBisson moved this task from Code Review to QA on the Inuka-Team (Kanban) board.Sep 29 2020, 2:27 PM
Jpita added a subscriber: Jpita.

Should Neil test this?

Should Neil test this?

The goal of this task is simply to add a query parameter to 2 URLs. You can test it if you want.

Jpita added a comment.Sep 30 2020, 3:32 PM


I see this query parameter but don't see the x-analytics part

The task description says that the x-analytics part is not in scope unless it becomes clear that it is needed because the query parameter approach doesn't work.

AMuigai closed this task as Resolved.Oct 5 2020, 10:51 AM