Maniphest T193701

Explore using user clicks data to tune Wikidata search parameters
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	May 3 2018, 3:02 AM

Description

Right now we are using tuning parameters for Wikidata search (both prefix and fulltext) which are more or less invented out of the thin air. I wonder if we could use some ML (or other) technology with actual user clicks data to have better tuning of those parameters.

Potential targets:

Entity weight parameters (both satu params and weights of features on entities). We are only using incoming links and sitelinks counts now - maybe we should use more features?
Relative weights of various matches - label, alias, description, other language, etc.?
For fulltext possibly also more advanced features that we're building with Mjolnir?

The start would be to actually build a data pipeline allowing us to know which search result was chosen by the user, especially for prefix search which is used ~1M times a day.

As this is an exploratory task, suggestions about what else could be done here are welcome.

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T193701 Explore using user clicks data to tune Wikidata search parameters
Resolved	Smalyshev	T196186 Collect click data from Wikidata prefix search into event logs
Resolved	debt	T205111 [EPIC] Transform wikidata autocomplete click logs into a useful dataset
Resolved	EBernhardson	T205348 Calculate autocomplete examination probabilities from eventlogging data
Resolved	EBernhardson	T205494 Add autocomplete evaluation via MRR to relforge
Resolved	Smalyshev	T205597 Add X-Search-Id to WikidataCompletionSearchClicks events
Declined	None	T205746 Cleanup wikidata autocomplete logs
Resolved	EBernhardson	T208917 Build pipeline to transform elastic explains into feature vectors and a tf graph
Resolved	EBernhardson	T209402 A/B testing plan for wbsearchentities, context=item
Resolved	EBernhardson	T211033 Analyze wbsearchentities AB test from nov/doc

Event Timeline

Smalyshev created this task.May 3 2018, 3:02 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 3 2018, 3:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We'll need to start some data collection - will take some time to do this (maybe 3-6 months at a first glance).

Smalyshev moved this task from Inbox to Scoring and result ordering on the CirrusSearch board.May 3 2018, 11:13 PM

Smalyshev added a project: Wikimedia-Hackathon-2018.May 5 2018, 6:53 AM

gabriel-wmde subscribed.May 7 2018, 3:04 PM

Multichill moved this task from Backlog to Project on the Wikimedia-Hackathon-2018 board.May 9 2018, 9:29 PM

What click data are you talking about?

Wikipedia's clickstream dumps that could be processed to compute the best candidates for a given search
Wikidata's clickstream? If so, did not know these were captured -- or do you want to perhaps add a capture system?

Wikidata's clickstream? If so, did not know these were captured -- or do you want to perhaps add a capture system?

@Lazhar, yes, this second option. This is specific to Wikidata and will involve exploring the possibility of doing something similar to what we are already doing on the Wikipedia side with machine learning, with a particular focus on prefix search to begin with.

Smalyshev edited projects, added Epic; removed Wikimedia-Hackathon-2018.Jun 1 2018, 6:56 PM

• Vvjjkkii renamed this task from Explore using user clicks data to tune Wikidata search parameters to qqdaaaaaaa.Jul 1 2018, 1:12 AM

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

CommunityTechBot renamed this task from qqdaaaaaaa to Explore using user clicks data to tune Wikidata search parameters.Jul 1 2018, 7:48 PM

CommunityTechBot lowered the priority of this task from High to Medium.

CommunityTechBot updated the task description. (Show Details)

CommunityTechBot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

CommunityTechBot added a subscriber: Aklapper.

debt closed subtask T196186: Collect click data from Wikidata prefix search into event logs as Resolved.Jul 31 2018, 5:54 PM

Liuxinyu970226 subscribed.Aug 14 2018, 10:21 PM

Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 17 2018, 8:18 AM

Work has started in T205111 to collect wikidata autocomplete click data and use the click data to perform offline evaluation of a proposed autocomplete ranker. The ability to evaluate the relative quality of multiple rankers is an essential first step to being able to tune the ranker.

EBernhardson moved this task from Up Next to Current work on the Discovery-Search board.Nov 13 2018, 6:30 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson added a subtask: T205111: [EPIC] Transform wikidata autocomplete click logs into a useful dataset.Nov 13 2018, 6:32 PM

EBernhardson added a subtask: T209402: A/B testing plan for wbsearchentities, context=item.

debt closed subtask T211033: Analyze wbsearchentities AB test from nov/doc as Resolved.Dec 10 2018, 9:09 PM

debt closed subtask T209402: A/B testing plan for wbsearchentities, context=item as Resolved.Dec 10 2018, 9:18 PM