Maniphest T205746

Cleanup wikidata autocomplete logs
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	EBernhardson
	Sep 28 2018, 8:59 PM

Description

While working on the MRR evaluation i noticed there are a variety of clickthroughs that look unlikely to be typed by a human, here are a few example autocomplete queries that led to clickthroughs:

Australian Citizen
Alumni Oxonienses: the Members of the University of Oxford, 1715-1886
Czech International Badminton Championships
Board of Intermediate and Secondary Education

These only look to be a few % of logs, but they break the assumption that all prefixes of the submitted query are useful for us to improve the results of.

A few approaches could be taken:

Each displayed search result could be logged. Instead of assuming useful prefixes only use prefixes that were actually displayed in the browser
We could track previously shown prefixes in browser and submit them with the click event
We could track previously shown x-search-id headers in browser and submit them with the click event
We could track previously shown prefixes in browser, and use some heuristic to decide if the click is worth logging
Probably others

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T193701 Explore using user clicks data to tune Wikidata search parameters
Resolved	debt	T205111 [EPIC] Transform wikidata autocomplete click logs into a useful dataset
Declined	None	T205746 Cleanup wikidata autocomplete logs

Event Timeline

EBernhardson triaged this task as Medium priority.Sep 28 2018, 8:59 PM

EBernhardson created this task.

EBernhardson updated the task description. (Show Details)

As discussed on IRC, I think the most promising approach is the following:

For each instance of search box opening, collect search term & search ID. Call the set of all searches performed for particular search box as search session.
When the click happens, scan through the search session and remove all searches where the term is not a prefix of the current search term. This removes typos and failed searches that the user decided to abandon.
Send the list of search IDs that remain for current session as a field in the click data.

This will also allow us to distinguish incremental completion from full-text copy-paste - the latter would have only one search ID in the session.

Smalyshev added a project: User-Smalyshev.Oct 3 2018, 7:26 AM

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Oct 3 2018, 8:08 PM

Smalyshev moved this task from Next to Backlog on the User-Smalyshev board.Nov 8 2018, 11:55 PM

Restricted Application added a subscriber: Urbanecm. · View Herald TranscriptNov 8 2018, 11:55 PM

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Nov 8 2018, 11:55 PM

Smalyshev moved this task from Next to Backlog on the User-Smalyshev board.Dec 7 2018, 7:28 PM

We ended up doing a successful transformation of these logs into a search model without cleaning them up, simply filtering high volume users. Perhaps this is unnecessary and can be declined?

That sounds reasonable. If it seems to be causing a problem in the future, we know where this ticket is.

Per above, this didn't end up blocking anything. If needed it could be done in the future.

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.May 6 2019, 4:00 PM

Cleanup wikidata autocomplete logsClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Cleanup wikidata autocomplete logs
Closed, DeclinedPublic
Actions

Related Objects
Search...