Page MenuHomePhabricator

Explore using user clicks data to tune Wikidata search parameters
Closed, ResolvedPublic

Description

Right now we are using tuning parameters for Wikidata search (both prefix and fulltext) which are more or less invented out of the thin air. I wonder if we could use some ML (or other) technology with actual user clicks data to have better tuning of those parameters.

Potential targets:

  • Entity weight parameters (both satu params and weights of features on entities). We are only using incoming links and sitelinks counts now - maybe we should use more features?
  • Relative weights of various matches - label, alias, description, other language, etc.?
  • For fulltext possibly also more advanced features that we're building with Mjolnir?

The start would be to actually build a data pipeline allowing us to know which search result was chosen by the user, especially for prefix search which is used ~1M times a day.

As this is an exploratory task, suggestions about what else could be done here are welcome.

Event Timeline

Smalyshev created this task.May 3 2018, 3:02 AM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 3 2018, 3:02 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.May 3 2018, 5:17 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt added a subscriber: debt.

We'll need to start some data collection - will take some time to do this (maybe 3-6 months at a first glance).

Lazhar added a subscriber: Lazhar.May 14 2018, 6:19 PM

What click data are you talking about?

  1. Wikipedia's clickstream dumps that could be processed to compute the best candidates for a given search
  2. Wikidata's clickstream? If so, did not know these were captured -- or do you want to perhaps add a capture system?
EBjune added a subscriber: EBjune.May 14 2018, 6:45 PM
  1. Wikidata's clickstream? If so, did not know these were captured -- or do you want to perhaps add a capture system?

@Lazhar, yes, this second option. This is specific to Wikidata and will involve exploring the possibility of doing something similar to what we are already doing on the Wikipedia side with machine learning, with a particular focus on prefix search to begin with.

Vvjjkkii renamed this task from Explore using user clicks data to tune Wikidata search parameters to qqdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from qqdaaaaaaa to Explore using user clicks data to tune Wikidata search parameters.Jul 1 2018, 7:48 PM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 17 2018, 8:18 AM

Work has started in T205111 to collect wikidata autocomplete click data and use the click data to perform offline evaluation of a proposed autocomplete ranker. The ability to evaluate the relative quality of multiple rankers is an essential first step to being able to tune the ranker.

explored! Shipping tuned wbsearchentities profile for en soon with de, fr, es coming soon.

debt closed this task as Resolved.Jan 18 2019, 7:13 PM
debt claimed this task.