Page MenuHomePhabricator

[Epic] Query Completion
Open, MediumPublic

Description

Motivation

The current auto-completion behavior across all wikis is a search against page titles. This is highly successful for encyclopedic content, such as finding the page about a specific topic, but is less than ideal for constructing a full text search query. We suspect auto-completion of search queries could be a significant improvement to the search interface particularly. The initial exploration and implementation will focus on commonswiki and multimedia search, where we expect the current autocomplete is not nearly as successful.

Desired Functionality

The Query Completion feature recommends possible full queries to a searcher while they are typing their current query. The list of query candidates is curated from the history of search queries other users have used.

Concerns

The primary concern to address is privacy. While query completion could be implemented using a variety of methods, implementations based on submitted user queries are simpler to build and are documented to significantly outperform models built from dictionaries or content, as long as sufficient user interactions are available to feed the system. User queries are PII and should not be directly released. Since providing the query completions to users is a de-facto release we need to deal with this. Typical approaches involve removing the long tail of search queries.

The second concern to address is NSFW content. Any sampling of a few dozen search queries is bound to have queries that refer to pornographic content. In initial explorations of the most popular queries starting with various letters, many of the top 5 query lists from commons have results that fall in this category. While we don't take a particular position on the appropriateness of content, we do have prior experience with concerns from communities when search returns nsfw content to seemingly unrelated queries (see gedit -> genit(al) and commons search is disabled in the enwiki sidebar). It is unclear what the appropriate way to address this is, but likely some consultation outside search platform will be required.

Proposal

As an initial exploration/implementation, we should try to keep things simple.

Assumptions

There is enough historical query data after chopping the long tail to provide useful query suggestions.

Candidate generation

An offline process run in analytics networks to process query logs and emit scored query completion candidates.

After a review of some literature there are some relatively simple scoring methods we can apply that will look at query frequency over time and predict the future frequency of a query. We can prototype with a very simple historical count, and then revisit the scoring with a time series based prediction that can account for seasonality and rate of change (holt-winters?). Completions can be sorted by the expected future frequency with the most popular completions presented to the user. There are more advanced scoring methods possible which take into account personalization and/or ML, but we are rejecting them for now as too-complex.

The initial prototype will chop off the long tail of queries to address the privacy concerns. Likely some method of evaluating the effectiveness of various thresholds needs to be decided on.

For NSFW queries we can source a blacklist of words and flag queries containing those words. Alternative methods of flagging nsfw queries involve looking at the ratio of SFW to NSFW results returned by the search. Unfortunately we don't have that classification available today, so a simple word blacklist will have to do. Ideally while evaluating query completion we should have the ability to return filtered or un-filtered query completions.

Index structure

The structure of what we store in elasticsearch and how it gets there.

In light of keeping things simple and focusing on commonswiki, a single completion index containing only commonswiki completions will be built and maintained. Data can be moved from analytics to elasticsearch using our existing swift pipeline. This should be able to reuse the index lifecycle maintenance written for glent in the mjolnir-kafka-bulk-daemon. If the work seems promising we can consider how to handle many wikis. An alternate option to consider would be building per-language completion indices, but it's unclear if the query language is usefully shared between sites.

For the index mapping the elasticsearch completion suggester seems like a good fit for this use case. While the completion suggester is an in-memory data structure, for the limited use case of commonswiki we can likely ignore resource usage for now. When considering how to expand this to all wikis we will need to take a closer look. It is not clear currently if the completion suggester should be used relatively bare, as CirrusSearch does today, or if we should carry forward additional context information (perhaps supporting filtering of nsfw-flagged queries?)

If time permits it would be interesting to apply the same process but sourcing queries by language instead of by wiki. On one side this offers a seemingly graceful solution to the multi-lingual nature of commonswiki, but on the other side the completions will be less specialized to commonswiki and media search, as such they may prove to be less useful.

Runtime support

The related code necessary to query the new completion candidates at the right time and return results to users.

As an autocomplete API the user experience is heavily tied to the latency of the requests. While mediawiki provides reasonable latency there will always be significant overhead as compared to the ~11ms p50 we see from the a golcurrent completion search backend. It's plausible we could integrate to restbase, but spreading search code around to even more locations seems undesirable. The mediawiki api is then the sensible place for query completion runtime support to live.

If we wish to integrate to the mediawiki api it's not clear where or how it should be offered. Attaching this as a profile of prefixsearch would be wrong, as it is not a prefix search against wiki titles. The sensible alternative is a new top-level api action that provides query completions. Alternatively there are new REST api's in mediawiki, perhaps somewhere in there would be sensible. As an initial exploration into query completion and having the goal of avoiding complexity, we can probably avoid the REST api and it's newness instead providing an experimental api action.

There is a remaining question about UI, do we keep a hard split between completion types or do we blend titles and queries into a single display? Again, in trying to keep things simple, no blending will occur on the backend. We could potentially issue multiple api requests and do a naive blend in the frontend for experimental reasons.

Analysis

user interface event logging and event analysis for autocomplete and search usage.

After some literature review the following metrics seem to be used. It's not clear yet which we should be concerned with, but i've included a few here for consideration.

  • MRR
    • Mean Reciprocal Rank looks at the mean rank of relevant results on a per-query basis.
    • Varients include MRR weighted by the number of candidates available. The intuition here is that shorter prefixes with more candidates are harder to get right. More weight should be given to good suggestions for ca than for california golden state warrio.
  • pSaved / eSaved.
    • "pSaved is defined as the probability of using a query suggestion while submitting a query. eSaved equates to the normalized amount of keypresses a user can avoid due to the deployed query suggestion mechanism." - A Survey of Query
    • We've previously optimized for this metric on wikidata, and used the actual number of keypresses in an AB test to evaluate how well it worked.
  • Success Rate at top K (SR@K)
    • "denotes the average ratio of the actual query that can be found in the top K query completion candidates over the test data. This metric is widely used for tasks whose ground truth consists of only one instance, such as query completion"
  • Minimal Keystrokes (MKS)
    • Scores "indicate the minimal number of keypresses that the user has to type before a query is submitted by clicking a query completion [Duan and Hsu, 2011]."
Future Directions
  • Completion diversity - Essentially filtering semantically similar queries to provide users with more unique options
  • Personalization - There are approaches that work within the current search context, not requiring long-term user profiles, that could be investigated
  • ML approaches are also possible, but generally seem to involve longer term user profiles
  • Entity detection - On wikis many searches include entities, this is likely why existing title prefix search works so well. Some form of entity detection may be useful, but there is limited research into this field.
  • A Query->Historical count index has use cases outside simply query completion. For full-text search knowing the query popularity can give important information about if this is a head or a tail query before performing the full-text search. This information can inform additional options such as performing spelling corrections.

Event Timeline

<moved into task description>

Nice write up! A few ideas and comments below.

look at query frequency over time

A random thought for a later iteration: Wikipedia is an unusual case because most titles are nouns or proper nouns, and we could try to map titles+popularity data to query+frequency to supplement it. Off the top of my head, we might be able to find a decent correlation between titles and queries to get a mapping of popularity data to frequency data (e.g., 10K views of Mr. Rogers <=> 2K queries for Mr. Rogers). While very popular titles wouldn't add much to high frequency queries, it might be something to mine for the long tail, since titles don't contain PII. Other options include mining Wikidata in a similar fashion.

account for seasonality

I suspect there are differences at the day-by-day level, with weekdays and weekends having different likelihoods for certain topics. Will we be able to have enough data to handle month-by-month seasonality? I worry that if we only process the last 90 days of data we might make seasonal suggestions "too late"—like recommending Christmas-related topics in January.

For NSFW queries we can source a blacklist of words

We might also be able to mine (shallow?) categories or Wikidata items to generate potential lists. The English Wikipedia category "Pornography Terminology" (which maps to some other languages) and the Wikidata items identified as "paraphilias" might be sources. We might also be able to look for relatively unambiguous translations or mappings in Wiktionary or Wikidata. ORES may also have some data for languages they've looked at.

Would it make sense to consider making this something the community could maintain? We could populate a candidate list from various sources, and then let the communities approve items and maintain their own list. This would make it easier to handle new NSFW results as they arise—"go here and propose adding it to the list"—and make it easier to accommodate languages where a good list is hard to source.

do we keep a hard split between completion types or do we blend titles and queries into a single display?

I think you earlier proposed having title completion in the Go box (upper corner search) and this more general query completion in the "big" search box on the Special:Search page. That still seems like a proposal worth considering, especially for the first iteration.

pSaved / eSaved

I like these because we've used them before for Wikidata (as you noted) and they seemed to work well. Definitely seems like it would be interesting to include these because we could compare improvement to the Wikidata work as a baseline.

A Survey of Query

Looks like the citation got truncated. A Survey of Query Auto Completion in Information Retrieval, perhaps?

I assume that normalization and grouping of queries will be something along the lines of what we currently do for the click model data, since that seems to work well.

Looking forward to seeing more!

After some initial working through the data, the initial plan of a word blacklist isn't going to work out. I manually reviewed the top 10 completions for all single letters, build a short (~40 word) blacklist, and then collected the set of two letter completions. Reviewing a sample of a few hundred of these completions, we still have an nsfw rate of ~5.5%. Manual review could continue, but to be honest it's a bit soul crushing and I am not interesting in maintaining such a banned word list going forward.

The standard practice in image search is to classify the images, so we are going to have to do exactly that. We will need to collect search results from the production api's, fetch the appropriate thumbnails, and pass them through an NSFW detection model. Ideally this should be done in a way that doesn't re-classify the same images every week, but our dataset is small enough that perhaps it wouldn't be a big deal.

Gehel renamed this task from Query Completion to [Epic] Query Completion.Oct 6 2020, 2:23 PM

I know the data on "successful" queries can be hazy, but does it make some sense to restrict the completed queries to only candidates that have resulted in a successful result -- in the chatbot world, this typically means not including completion candidates that resulted in a Not Understood classification. In other words, it doesn't actually help the user if we help them complete their query to something that has no or irrelevant results. Ideally we could remove all the query candidates that are dead ends, since they're ultimately not helping the user anyway.

I don't think we have a clear way right now of quantifying which query candidates resulted in "good" search results, but I think it should be something we want to push more on in future improvements to this functionality. Note that this could potentially also be a way of also weeding out some subset of NSFW query candidates, especially if they aren't returning anything useful in the first place.

Generally we've referred to these are SAT(isfied) or DSAT. Within full text search we previously defined SAT as a query that results in a clickthrough and a specified amount of dwell time (10 or 20 seconds iirc) on the found page. All queries that are not SAT are thus DSAT. It's not clear if dwell time would be a valid interaction metric for images, there is minimal text for the user to spend their time reading. Additionally we never worried about how strictly accurate this was, as what we were looking for was directional changes in the metric due to an AB test. If we are using the metric for throwing things out we might want it to be much more accurate.

If we wanted to do a filter, this easiest interaction that we already measure is a simple clickthrough without dwell time. This might give a lower bound on acceptable queries, although there is also the concept of "good abandonment" which states that some users can be satisfied by the SERP without interacting with any individual results. Good abandonment is particularly prevalent in image grids, although also occurs with behaviour such as google presenting answer snippets at the top of a SERP.

We would likely need to run some analysis and see what effects additonal filters havve. I suspect that queries with enough repeats to escape the privacy cutoff will also generally have clickthroughs, but it could change the counts.

MPhamWMF triaged this task as Medium priority.Mar 9 2022, 8:42 PM