Page MenuHomePhabricator

Calculate autocomplete examination probabilities from eventlogging data
Closed, ResolvedPublic

Description

http://terrierteam.dcs.gla.ac.uk/publications/kharitonov-sigir2013.pdf
Algorithm 1: Learning the prefix length-independent Aⱼ and the prefix length-dependent Bᵢⱼ probabilities of examination of a query suggestion presented on position j for a prefix of length i.

The algorithm itself is a pretty simple count and ratio. It could be formulated a few ways, but here is a simple pseudocode example:

for session in sessions:
  query = session.query
  for i in range(len(query) - 1):
    j = session.results(query[:i]).index(session.clicked)
    if j is not None:
      A[j].skipped += 1
      B[i][j].skipped += 1
  j = session.results(query).index(session.clicked)
  A[j].clicked += 1
  B[i][j].clicked += 1

A = [a_j.clicked / (a_j.clicked + a_j.skipped) for a_j in A]
B = [b_ij.clicked / (b_ij.clicked + b_ij.skipped) for b_ij in b_i] for b_i in B]

Unfortunately we don't have enough data in our autocomplete logging (wikidata or search satisfaction) to calculate this because we don't have any link from eventlogging back to the result sets that were returned. These result sets are logged, but the associated searchToken does not make it into either set of logs. Search satisfaction autocomplete logs that do have a searchToken are erroneous, they all come from Special:Search and record the search token of the full text search and not the autocomplete. Wikidata autocomplete doesn't attempt to collect the searchToken (nor is it available in the api response).

Update eventlogging for wikidata and search satisfaction so we can track examination probabilities, and implement calculation of these values. It's not clear where the implementation goes, but the examination probabilities will be further used for offline evaluation so perhaps it would make sense in mjolnir, or perhaps we need a new repository for autocomplete?

Event Timeline

We need to update wikidata and search satisfaction logging to do one of the following:

  1. Record the searchToken for each result set displayed
  2. Record the list of pages for each result set displayed. This would probably necessitate recording an event for each displayed search.
  3. Keep an in-memory history of autocomplete results. When an item is selected look back through the history for all the possible prefixes and record what prefixes and what position held the item finally selected

Option 1 has the difficulty of fitting the search token into the response. We can probably reuse the same trick used to report the autocomplete algorithm used, attaching a custom HTTP header. This option has the downside that backend analytics will need
to pull in the CirrusSearchRequestSet table at ~70GB a day. This is fine for automated tasks but makes exporation from a REPL annoyingly slow. Additionally this would mean the eventlogging can't stand on it's own which doesn't seem as desirable.

Option 2 fits best with how the search satisfaction logging currently works. If possible we will want to log both the page id, and the title provided. In cirrus the page_id will reflect the final document, where the displayed title may be a redirect. This fits somewhat poorly with the eventlogging model though. We are limited in the amount of data that can be sent in a single event, and a list of 10 titles could be up to 2.5kB.

Option 3 is more privacy sensitive, and collects the minimum amount of information necessary. Options 1 and 2 require logging an event for each result set displayed to the user, where this option will still wait for the user to select something and only record the positions where the desired result was displayed but not selected. Additionally metrics such as zero result rate and session abandonment can be calculated from the data collected by option 2, but need to be separately collected for option 3.

Change 462762 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/core@master] Forward X-Search-ID header to search suggest tracking

https://gerrit.wikimedia.org/r/462762

Change 462775 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Include X-Search-ID header on api requests

https://gerrit.wikimedia.org/r/462775

Change 462797 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Track backend search token for autocomplete results

https://gerrit.wikimedia.org/r/462797

Change 462775 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Include X-Search-ID header on api requests

https://gerrit.wikimedia.org/r/462775

Change 462762 merged by jenkins-bot:
[mediawiki/core@master] Forward X-Search-ID header to search suggest tracking

https://gerrit.wikimedia.org/r/462762

Change 462797 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Track backend search token for autocomplete results

https://gerrit.wikimedia.org/r/462797

We have a week worth of autocomplete data for wikidata so i took a stab at this. It's only in a python notebook on SWAP but will hopefully clean it up into something. Currently it generates daily extracts from merged cirrus + eventlogging logs an writes them to ebernhardson.wikibasecompletionclicks in hive. I've then run the click/skip counts over them and generated the following examination probabilities

wbsearchentities_oct_4_to_14.heatmap.png (592×528 px, 40 KB)

To generate confidence intervals i sampled with replacement against the set of clicks (69k for one week) for 1k iterations.
This is with prefix length on the x axis, and result position on the sub-bars:

wbsearchentities_oct_4_to_14.by_prefix_then_pos.png (592×1 px, 13 KB)

And again, but with result position on the x axis and prefix length on the sub-bars:

wbsearchentities_oct_4_to_14.by_pos_then_prefix.png.png (592×1 px, 14 KB)

Overall the numbers seem sane. result position is the primary influence, but there are changes depending on the prefix length. Next step will be to integrate this into a metric in relforge. I might re-run this to only aggregate by result position for the relforge metric, as the confidence intervals outside of the first position are quite large but would likely tighten up if we only aggregated across a single dimension.

In the heatmap I was surprised to see better examination probablities at pos+1 for a given prefix, but reading the paper I note (1f contraint) :

Queries that have been seen by the user, but not selected from the suggestions list are also considered as non examined. This do not influence generality of our proposed model.

Is it right to say that "examination" here means "selected" (or clicked)?

How can we interpret that at prefix 1, results at pos 6 have 3 more chances to be examined (selected) than results at pos 5?
Can it be noise that with more data will disappear, or perhaps a particular prefix where the preferred result is always at pos 6?
Taking into account the confidence intervals this is less pronounced, I suppose we need more data to really tell?

Sorry for catching up late.

the prefix length-dependent Bᵢⱼ probabilities of examination of a query suggestion presented on position j for a prefix of length i.

I feel like I must be missing something. It seems like the sum over positions j of Bᵢⱼ for a fixed prefix length i should be no more than 1.

This is with prefix length on the x axis, and result position on the sub-bars:

wbsearchentities_oct_4_to_14.by_prefix_then_pos.png (592×1 px, 13 KB)

Looking at prefix length 4, it looks like position 1 gets ~80% examination probability, and position 2 gets ~70%, 3 gets ~40%, 4 gets ~30%, 5 gets ~25%. Just those five largest add up to 245%. Are we averaging > 2.5 clicks per suggestion?

Bᵢⱼ

Excellent use of Unicode subscripts, BTW.

Sorry for catching up late.

the prefix length-dependent Bᵢⱼ probabilities of examination of a query suggestion presented on position j for a prefix of length i.

I feel like I must be missing something. It seems like the sum over positions j of Bᵢⱼ for a fixed prefix length i should be no more than 1.

I'm not sure why that would sum to 1, the positions are fully independent. Each element of Bᵢⱼ is the count of times something was clicked at that position divided by the count of times a clicked result was presented in that position. There is no direct link that would imply the sum of probabilities over either dimension should add to 1. I'm having a hard time explaining why though...

This is with prefix length on the x axis, and result position on the sub-bars:

wbsearchentities_oct_4_to_14.by_prefix_then_pos.png (592×1 px, 13 KB)

Looking at prefix length 4, it looks like position 1 gets ~80% examination probability, and position 2 gets ~70%, 3 gets ~40%, 4 gets ~30%, 5 gets ~25%. Just those five largest add up to 245%. Are we averaging > 2.5 clicks per suggestion?

Those are all independent values. For prefix length 4 it's saying if we put the desired result in position 1 there is an 80% chance the user will select the item, and a 20% chance they will continue typing. Similarly at position 2 a 70% chance of seeing the result, and a 30% chance to continue typing. Perhaps the distinction here is that examination probability is only defined when the desired result was presented to the user. I suppose there is an examination probability of wrong items, but we don't have any way to measure if users looked at those.

Rather what is happening here is we define satisfaction as "selected item from autocomplete list". We define examination as "if a result was clicked, then it was examined", along with "if the desired result was in the result list but not clicked, it was not examined." Using these we calculate how likely a user is to select the correct result when it is presented to them.

Bᵢⱼ

Excellent use of Unicode subscripts, BTW.

I liked it too :)

I see (sort of)—I misunderstood what the stats were measuring. I thought it was the percentage of what was clicked given that a click occurred (in which case they should sum to 1). I'll have to digest your description and think it through again. Thanks!

In the heatmap I was surprised to see better examination probablities at pos+1 for a given prefix, but reading the paper I note (1f contraint) :

Queries that have been seen by the user, but not selected from the suggestions list are also considered as non examined. This do not influence generality of our proposed model.

Is it right to say that "examination" here means "selected" (or clicked)?

Yes, for the purposes of this algorithm selected and examined are considered equivalent. Essentially the assumption is that if the desired result is presented to the user only two things can happen:

  1. The user sees (examines) the result, and selects it
  2. The user does not see the result, and keeps typing

And the quoted section of the paper is basically saying sometimes the user may see the result but keep typing anyways. For the purposes of this model treating those as not-examined still gives a sane result.

How can we interpret that at prefix 1, results at pos 6 have 3 more chances to be examined (selected) than results at pos 5?
Can it be noise that with more data will disappear, or perhaps a particular prefix where the preferred result is always at pos 6?
Taking into account the confidence intervals this is less pronounced, I suppose we need more data to really tell?

I think this has to do with the amount of noise, more data would likely clean it up a bit. Will be worth re-visiting with a month or two of data to see how things pan out.