Page MenuHomePhabricator

[EPIC] Transform wikidata autocomplete click logs into a useful dataset
Closed, ResolvedPublic

Description

We have started to collect click logs for wikidata item autocomplete, but are not 100% sure how to utilize them. Will need to review relevant autocomplete literature to see how this has been tackled before. There are two main goals for this data:

  • Perform offline evaluation of an autocomplete algorithm to be compared to the current production ranker
  • As input data to a learning algorithm. Potentially the tensorflow based elasticsearch query optimizer, but also potentially things that are revealed through literature review.

Once literature review is complete and multiple options are identified we will build one or more of the systems. For this epic the offline evaluation is to be fully implemented, and the data inputs for the learning algorithm should be generated (but not necessarily implementing the full learning pipeline).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2018, 3:52 PM

This is a nice survey of the query autocomplete literature (circa 2016): https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/cai-survey-2016.pdf

The survey has some great information on the main approaches to autocomplete, latest papers in each area, and a great section on autocomplete evaluation.

Useful references:

A Two-Dimensional Click Model for Query Auto-Completion: https://www.cs.virginia.edu/~hw5x/paper/Li_fp084.pdf

  • Compares against DBN, and a very simple method called "most popular completion". MPC performs very well in their comparison (while not the best) and seems like a good baseline candidate

User Model-based Metrics for Offline Query Suggestion Evaluation: http://terrierteam.dcs.gla.ac.uk/publications/kharitonov-sigir2013.pdf

  • The traditional autocomplete metric, MRR, doesn't try and model user satisfaction. The metric suggested here takes that into account

Using Interaction Data for Improving the Offline and Online Evaluation of Search Engines: http://theses.gla.ac.uk/7750/1/2016KharitonovPhD.pdf

  • phd thesis from author of user model based metrics paper
  • Too exhaustive to fully read, but looks to be a good breakdown of building an autocomplete system

I spent most of friday digging through User model based metrics for offline query suggestion evaluation. The two metrics provided, eSaved and pSaved, are not too dissimilar from MRR. The simpler of the two is pSaved. In pSaved the metric is the sum of P(S_ij = 1) across i and j where i represents the number of letters typed, and j represents the position of suggestion. P(S_ij=1) is defined relatively simply: The user is satisfied if the correct result is provided and the user examines that result. The paper gives a relatively simple algorithm for looking over your interaction logs and calculating the probability of examination. eSaved is a modification of pSaved that further accounts for length.

The existing data we collect will allow us to generate these metrics relatively easily for a ranking algorithm under test, but does not provide us enough data to calculate the metric directly from our historical data. To calculate this for historical data we will need to augment event logging to either log each query/result in an autocomplete session, or to submit with the final click the list of prefixes that included the clicked result and at which positions it was displayed (and not examined, or at least not selected.)

Other metrics that i've seen used for autocomplete in my review:

  • Session abandonment (typed something, but no final item selected).
  • Zero results rate (perhaps even more important for item completion than query completion)
  • Basic stats about distribution of the length of prefix typed / number of characters saved.

Some of these cannot be directly measured with our current event logging.

  • We only log when a user succesfully finds something, so we can't calculate session abandonment or zero results rate. Zero results might be estimatable from our backend logs.
  • Additionally because we only log success we also don't know anything about reformulation. There are no plans currently to do anything with reformulation so it may not yet be necessary.

Session abandonment (typed something, but no final item selected).

That would be nice to track somewhere, e.g. dashboard.

Zero results rate

Also interesting, though as we know from past experience we shouldn't panic if it is substantial since some bots use it a lot and failure is expected (e.g. when checking for strings that shouldn't be there).

Basic stats about distribution of the length of prefix typed / number of characters saved

That's an interesting metrics, though since each item has several names, I wonder if we should benchmark against the name that was selected (we currently don't collect it but we probably could) even though it might be not optimal. I.e. if there's label like "San Francisco" and "San Francisco, California" - either can be selected as a match for highlighting, I assume, but if we rely on that as "characters saved" we could get wildly different results.

we can't calculate session abandonment or zero results rate

We could see searches that have zero results (we don't track them now IIRC but we could), sessions probably a bit trickier but possible.

we also don't know anything about reformulation

That would be hard, it's difficult to know whether somebody "reformulated" or just decided to search for another entity and whether these two searches were connected at all or just coincided in time. I think we should ignore reformulation for now.

EBernhardson added a comment.EditedSep 24 2018, 10:15 PM

In addition to this, see T205348 for updating eventlogging to collect enough information to calculate examination probabilities.

Session abandonment (typed something, but no final item selected).

That would be nice to track somewhere, e.g. dashboard.

Zero results rate

Also interesting, though as we know from past experience we shouldn't panic if it is substantial since some bots use it a lot and failure is expected (e.g. when checking for strings that shouldn't be there).

I'm optimistic that we can do a better job of bot filtering for the autocomplete case, although doing so might be tricky. In particular if we look at the number of result sets displayed in the browser compared to the prefix length we should be able to differentiate "looking at results as you type" from "pasted in a full query". Bots and users pasting something into the box are probably equivilant in our case.

Basic stats about distribution of the length of prefix typed / number of characters saved

That's an interesting metrics, though since each item has several names, I wonder if we should benchmark against the name that was selected (we currently don't collect it but we probably could) even though it might be not optimal. I.e. if there's label like "San Francisco" and "San Francisco, California" - either can be selected as a match for highlighting, I assume, but if we rely on that as "characters saved" we could get wildly different results.

Multiple names for the same thing does complicate it a bit. From the literature I've seen so far they look explicitly at the final query selected. Of course in that literature they also don't have multiple options, the user is auto completing a query and not an item from the database. In our case the name displayed should be chosen by the highlighter, I think we should keep it simple and calculate based on the text the user selected, but keep in mind if tests effect highlighting the metrics wrt characters saved will be less meaningful.

we can't calculate session abandonment or zero results rate

We could see searches that have zero results (we don't track them now IIRC but we could), sessions probably a bit trickier but possible.

We might also need to define session somewhere. For the Search Satisfaction schema we define a session as all the queries from the start of tracking until no queries have been issued for 10 minutes. That might be useful for calculating some per-user metrics without tracking users over time, but it's a very different session than what autocomplete literature considers a session. In autocomplete a session is generally described as everything that happens between entering an autocomplete box and selecting an item. With this definition of session i think we could work something up, but there will probably be a number of edge cases I'm not thinking of.

we also don't know anything about reformulation

That would be hard, it's difficult to know whether somebody "reformulated" or just decided to search for another entity and whether these two searches were connected at all or just coincided in time. I think we should ignore reformulation for now.

I agree, not worth worrying about yet.

Of course in that literature they also don't have multiple options, the user is auto completing a query and not an item from the database. In our case the name displayed should be chosen by the highlighter,

I think there's a bit of a difference here. Namely, when we talk about QAC model, we suppose that the user was intending to type query A, but QAC helped to type only prefix(A) and filled in the rest. However, for the item completion, the use may have never intended to type the whole name: e.g. see something like Q52754781 - "Heat-stress survival in the pre-adult stage of the life cycle in an intercontinental set of recombinant inbred lines of Drosophila melanogaster.". Nobody would every type such a name in full (at least if we exclude bots). Of course, this is a corner case but for non-corner cases it may be true too. Moreover, since the goal of the user to find an item and not the particular name, if may be that they'd type different strings to find the same item.

For display, we actually have two things that we return - actual item label (in current language) and whatever the highlighter finds (which may or may not be in current language). Not sure how this fits the models.

We might also need to define session somewhere. For the Search Satisfaction schema we define a session as all the queries from the start of tracking until no queries have been issued for 10 minutes.

I think time-based won't work very well here, as in 10 minutes an active editor would likely search for dozens of items, with little relationship between them. However, we could bind into a session a sequence of searches that happens within the same open selection box (i.e. from the time completion started until the box is closed, either by means of successful click or by abandonment). If we consider then sessions that lead to abandonment, we might get something useful maybe. This also may help us filter out bots - bots don't use selection boxes so they won't have any session info.

dcausse added a comment.EditedSep 25 2018, 9:04 AM
  • Basic stats about distribution of the length of prefix typed / number of characters saved.

I suggest not to give too much importance to this metric I feel that it may tend to push long titles first while I feel we should favor short suggestions first. I'd just look at the query length before click (the shorter the better).
The particular problem we have to solve is that autocomplete here is not a step in the search experience but an attempt to jump to a result (not a result page).
imo abandonment is the most important metric. To have more data I'd look into collecting data from the entity selector (when editing items) and perhaps consider a search by QID a failure?

And something to keep in mind is that searching by prefix won't allow us to solve all cases, a small illustration is page frequency per prefix length:

Each dot correspond to a prefix (grouped by length), when adding more chars we diminish the number of ambiguous prefixes but there are still quite a few highly ambiguous ones even with a prefix length of 10 (and IIRC I extracted this data from simplewiki).

Since the query sent to elastic hugely varies depending on the user language we should keep this information alongside the query text or we will completely skew the data-set.

Another useful reference, this follows the development of autocomplete from MPC to ~2016: https://www.slideshare.net/YichenFeng1/tutorial-on-query-autocompletion

TJones added a comment.Oct 5 2018, 7:48 PM

Late to the party, but here's my 2¢ on the excellent discussion so far.

For ZRR, if we do track it, it would be interesting to track it by query length, too. While I agree with Stas that there could be a lot of garbage, any fuzziness on the matching, and not just straight prefix matching, makes short strings (less than 5 characters) very likely to have suggestions.

For reformulation and determining "sessions", we might be able to figure out some heuristics based on comparing sequential searches. If a query is a non-trivial prefix or substring or non-contiguous subset of the previous query, or a has a small proportional edit distance, then it looks like the user deleted or changed some letters and it's probably a reformulation. Similarly, there are probably heuristics to identify new searches—query length drops from > 5 to < 3, or a fairly large edit distance. We could probably get some sequential queries and have human raters identify reformulations and new queries to have some data to validate hypotheses on reasonable thresholds, and also to see how often reformulation or new queries (within 10 minutes) happen.

Smalyshev triaged this task as Medium priority.Nov 13 2018, 7:16 PM

Transformed! (sort of). Relevance Forge now has a utility for taking in the wikidata completion search logs and tuning the parameters of search based on those logs.

debt closed this task as Resolved.Jan 18 2019, 7:14 PM
debt claimed this task.
debt added a subscriber: debt.

well done! :)