Page MenuHomePhabricator

Cleanup wikidata autocomplete logs
Closed, DeclinedPublic


While working on the MRR evaluation i noticed there are a variety of clickthroughs that look unlikely to be typed by a human, here are a few example autocomplete queries that led to clickthroughs:

  • Australian Citizen
  • Alumni Oxonienses: the Members of the University of Oxford, 1715-1886
  • Czech International Badminton Championships
  • Board of Intermediate and Secondary Education

These only look to be a few % of logs, but they break the assumption that all prefixes of the submitted query are useful for us to improve the results of.

A few approaches could be taken:

  • Each displayed search result could be logged. Instead of assuming useful prefixes only use prefixes that were actually displayed in the browser
  • We could track previously shown prefixes in browser and submit them with the click event
  • We could track previously shown x-search-id headers in browser and submit them with the click event
  • We could track previously shown prefixes in browser, and use some heuristic to decide if the click is worth logging
  • Probably others

Event Timeline

EBernhardson created this task.
EBernhardson updated the task description. (Show Details)

As discussed on IRC, I think the most promising approach is the following:

  • For each instance of search box opening, collect search term & search ID. Call the set of all searches performed for particular search box as search session.
  • When the click happens, scan through the search session and remove all searches where the term is not a prefix of the current search term. This removes typos and failed searches that the user decided to abandon.
  • Send the list of search IDs that remain for current session as a field in the click data.

This will also allow us to distinguish incremental completion from full-text copy-paste - the latter would have only one search ID in the session.

We ended up doing a successful transformation of these logs into a search model without cleaning them up, simply filtering high volume users. Perhaps this is unnecessary and can be declined?

That sounds reasonable. If it seems to be causing a problem in the future, we know where this ticket is.

Per above, this didn't end up blocking anything. If needed it could be done in the future.