Page MenuHomePhabricator

Re-examine how internal search referrals are handled by Clickstream
Open, MediumPublic

Description

Goal

My ideal outcome: all pageviews that come from any internal Wikimedia search for a given article are labeled with something like internal-search on the clickstream dataset.
For more background, see parent task: T289532

Current Situation

My current understanding of search referrals on Wikipedia is that they come from two places (both of which I'd like to be able to identify and aggregate into internal-search):

  • Special:Search
    • This seems to show up as a referer in the style of ...wikipedia.org/w/index.php?search=<keyword>&title=Special:Search... with a referer_class of internal and thus is filtered out here and labeled as other-internal.
    • It's not clear to me whether it's a fair assumption that all other-internal traffic is Special:Search or if it would encapsulate other entrypoints as well?
    • It's also seems that Special:Search adds a wprov parameter though it's unclear to me whether that's permanent and can be depended on or not to identify search referrals?
  • the search box on any article:
    • The referer will match the page where the search was made. Assuming that the page where the search was made is more or less random, I assume all of these referrals get filtered out for not meeting the minimum count of 10 referrals to show up in the dataset.
    • I don't see any wprov parameters associated with this.
    • This also generates a call to the search API though I'm not sure if that's helpful at all and presumably doesn't show up in the wmf.pageview_actor table that is used for generating the clickstream data.

Open questions:

  • What's the balance of Special:Search vs. search box -- i.e. can a basic solution only include one of these or does it really need both to be accurate?
  • Is that a complete list of internal search points on Wikipedia? Are there others that we should be aware of if we expand to non-Wikipedia wikis such as Wikitech, Metawiki, or Wikidata?
  • What solutions could we have for identifying these search referrals? Expanded/permanent usage of the wprov parameter? A full new subquery that aggregates all internal search traffic? Something else?

Resources:

Event Timeline

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.

I'm going to remove this task from the Backlog lane of the Research board given that there is no task for Research here, yet. Please reach out to us with a subtask and add Research back. We would be happy to look into prioritizing supporting you at that point.