Page MenuHomePhabricator

Search Metrics - Read traffic generated by Search
Open, HighPublic

Description

See parent task for details.

Initially, the output should specify the ratios as:

  1. Number of user pageviews associated with Wikimedia search divided by each of the following
  2. Number of user pageviews associated with external search (by way of the referer_data field of webrequest; bonus points for breakout by referer_name when it is external search-based) divided by each of the following
  3. Number of Related Articles fetches via the edge (mobile web only) divided by the following (n.b. this will serve as a good-enough proxy for impressions for now)

Preferred:

  • internally referred daily pageviews
  • inferred monthly users, based on pageview_actor hashing approach
  • inferred daily users, based on pageview_actor hashing approach
  • estimated daily unique devices, based on unique_devices_per_domain_daily
  • total daily user pageviews

And if queries will run successfully:

  • internally referred monthly pagviews
  • estimated monthly unique devices, based on unique_devices_per_domain_monthly
  • total monthly user pageviews

Details

TitleReferenceAuthorSource BranchDest Branch
Cirrus metrics calculationsrepos/search-platform/notebooks!4ebernhardsonsearch-metricsmain
Customize query in GitLab

Event Timeline

If we want this to be directly comparable to page views then i imagine this should be implemented as a classifier against the web requests table. We would miss a few narrow cases with cross-domain search results (sister-search) but I suspect the referrer attached to the page views is sufficient to classify page views as from-search or not.

Started to look into this a bit closer. We will probably need to do custom work for each endpoint we want to classify. To start with:

  • Special:Search. These can perhaps be directly classified by looking for pageviews with wprov=srpw1_x (x >= 0). There might be limitations, but this looks pretty straight forward
  • Autocomplete. These pass wprov=acrw1_x (x >= -1), but the wprov is only set on the Special:Search request. Due to the UI flow that request isn't considered a page view, and isn't a referer to the page view either. We could consider having Special:Search proxy the wprov value forward in redirects. Alternatively we could count the redirects as a page view, which is probably correct most of the time but we might want to do some light analysis to verify that assumption.
  • Related Articles. As far as i can tell nothing in the webrequest log allows is to tell if a page view came from related articles. The traditional way to address this would be to include a wprov parameter to the related articles links.

One caveat of all of these is that they depend on javascript running in the browser. In general that will mean we undercount by some %, but probably not a big enough % to be important.

Change #1018766 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/RelatedArticles@master] Add a wprov parameter for measuring impact

https://gerrit.wikimedia.org/r/1018766

Worked through most of this and can compute single day stats that seem plausible with a notebook (on stat1007, ticket number prefixed to file name in my home dir). Will come back to it once the other metrics are figured out and extend this to calculate 90 days of dailies and offer monthly and ~quartly numbers over those daily stats. To follow up on the above:

  • Special:Search: wprov looks to work for this on desktop. We aren't seeing any special:search traffic on mobile, but i'm sure there are at least a few so there is still a bit of a data problem. Might ignore mobile here as the numbers are certainly going to be small.
  • autocomplete: desktop is counted using the wprov. For mobile we ended up counting by looking at the searchToken query param we attach to the referer of mobile autocomplete clickthroughs.
  • related articles: The patch will need to be reviewed and merged, but the same general analysis will apply and shouldn't be too hard to write. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/RelatedArticles/+/1018766

@EBernhardson I updated the AC to capture the essence of IRC discussion and the what we went over in Etherpad.

Change #1018766 merged by jenkins-bot:

[mediawiki/extensions/RelatedArticles@master] Add a wprov parameter for measuring impact

https://gerrit.wikimedia.org/r/1018766

@EBernhardson I had duplicated the verbiage "estimated daily unique devices, based on unique_devices_per_domain_monthly", but have now updated the Preferred section to say "estimated daily unique devices, based on unique_devices_per_domain_daily". I think you have this covered already, but just wanted to make sure the edit was obvious.

@EBernhardson I had duplicated the verbiage "estimated daily unique devices, based on unique_devices_per_domain_monthly" (emphasis on incorrect "monthly" in Preferred section), but have now updated the Preferred section to say "estimated daily unique devices, based on unique_devices_per_domain_daily" to correct this glitch. I think you have this covered already, but just wanted to make sure the edit was obvious.

Four tickets were combined into a single ticket, two calculations, and found in the patch above:

  • T358349 - number of searches
  • T358350 - successfull searches
  • T358351 - read traffic generated by search
  • T358352 - number of user sessions using search