Page MenuHomePhabricator

Search Metrics - Number of user sessions using search
Open, HighPublic

Description

See parent task for details.

Although it will be imperfect, the output should specify the ratios as

  1. Number of users having a search session divided by each of the following
  2. Where it can be ascertained, number of search sessions divided by each of the following

Preferred:

  • inferred daily users, based on pageview_actor hashing approach
  • estimated daily unique devices, based on unique_devices_per_domain_daily
  • daily user pageviews

If queries will run successfully:

  • inferred monthly users, based on pageview_actor hashing approach
  • estimated monthly unique devices, based on unique_devices_per_domain_monthly
  • monthly user pageviews

Details

TitleReferenceAuthorSource BranchDest Branch
Cirrus metrics calculationsrepos/search-platform/notebooks!4ebernhardsonsearch-metricsmain
Customize query in GitLab

Event Timeline

dr0ptp4kt moved this task from needs triage to Current work on the Discovery-Search board.

If we want a very simple count we currently record a weak fingerprint of the browser which is basically a hash of the ip address and the username. Due to the way this data is collected it does not include cached results, primarily that is short autocompletes and the related articles. This can be counted over whatever time dimension we want. The downside of this is that it's not directly comparable to anything. It is an absolute number and the directionality would be meaningful, but as a standalone datapoint it would be hard to say these sessions represent x% of all unique devices.

If we want to relate the metric to unique devices then we would plausibly need to implement it in the same general way that unique devices is done, so that we are counting the same thing. At a very general level this is implemented via the WMF-Last-Access cookie and various machinery to always set it to today and calculate the unique devices from the resulting web request logs. We have some of the same caching concerns as mediawiki, we can't simply emit these from the backend and expect everything to work correctly. We can likely follow the lead of everything that was put in place for WMF-Last-Access, but we might want to check with relevant parties (varnish sres, unique devices metric owner) that extending this implementation is sensible.

Another related question is if we should be distinguishing by endpoint. Separating, for example, autocomplete, full-text, and related articles seems sensible. At least separating related articles into a different group. A difficulty there is that the existing mechanism used for unique devices requires a unique cookie for each case to be tracked. I'm doubtful we want to add many new cookies for the tracking use case.

Adam suggesting taking an easier way out and using the actor_signature definition of a unique device. This hashes together a couple values in the web request to create a fingerprint. The absolute number won't really be comparable to the overall unique devices metric, but we can calculate a % of actor_signatures and assume that it's in the same ballpark.

I slightly redefined the metric to make it easy to calculate. The metric calculated is the number of actors performing at least 1 page view attributed to a search clickthrough. This excludes actors that search but never click through. Perhaps a slight undercount, but it still seems like a useful and reasonable metric. This allowed reusing the classifications in T358351 for detecting if a page view is attributed to a search, making this a simple problem of grouping over the actor signature and flagging if any of their page views came from search. This now also has a plausible jupyter notebook for calculating the stats found on stat1007 in my home dir prefixed with the ticket number. It only calculates a single day, will return and calculate the 90-day sample after we have nailed down the remaining metrics.

Updated AC to say daily where it incorrectly said monthly within the Preferred section. It already said "estimated daily unique devices" so was hopefully sufficiently clear, but still. Sorry!

Four tickets were combined into a single ticket, two calculations, and found in the patch above:

  • T358349 - number of searches
  • T358350 - successfull searches
  • T358351 - read traffic generated by search
  • T358352 - number of user sessions using search