Change Details

===== Motivation ===== The current auto-completion behavior across all wikis is a search against page titles. This is highly successful for encyclopedic content, such as finding the page about a specific topic, but is less than ideal for constructing a full text search query. We suspect auto-completion of search queries could be a significant improvement to the search interface particularly. The initial exploration and implementation will focus on commonswiki and multimedia search, where we expect the current autocomplete is not nearly as successful. ===== Desired Functionality ===== TODO ===== Concerns ===== The primary concern to address is privacy. While query completion could be implemented using a variety of methods, implementations based on submitted user queries are simpler to build and are documented to significantly outperform models built from dictionaries or content, as long as sufficient user interactions are available to feed the system. User queries are PII and should not be directly released. Since providing the query completions to users is a de-facto release we need to deal with this. Typical approaches involve removing the long tail of search queries. The second concern to address is NSFW content. Any sampling of a few dozen search queries is bound to have queries that refer to pornographic content. In initial explorations of the most popular queries starting with various letters, many of the top 5 query lists from commons have results that fall in this category. While we don't take a particular position on the appropriateness of content, we do have prior experience with concerns from communities when search returns nsfw content to seemingly unrelated queries (see `gedit` -> `genit(al)` and commons search is disabled in the enwiki sidebar). It is unclear what the appropriate way to address this is, but likely some consultation outside search platform will be required. ==== Proposal ==== As an initial exploration/implementation, we should try to keep things simple. Proposal ===== Assumptions ===== There is enough historical query data after chopping the long tail to provide useful query suggestions. ===== Candidate generation ===== An offline process run in analytics networks to process query logs and emit scored query completion candidates. After a review of some literature there are some relatively simple scoring methods we can apply that will look at query frequency over time and predict the future frequency of a query. We can prototype with a very simple historical count, and then revisit the scoring with a time series based prediction that can account for seasonality and rate of change (holt-winters?). Completions can be sorted by the expected future frequency with the most popular completions presented to the user. There are more advanced scoring methods possible which take into account personalization and/or ML, but we are rejecting them for now as too-complex. The initial prototype will chop off the long tail of queries to address the privacy concerns. Likely some method of evaluating the effectiveness of various thresholds needs to be decided on. For NSFW queries we can source a blacklist of words and flag queries containing those words. Alternative methods of flagging nsfw queries involve looking at the ratio of SFW to NSFW results returned by the search. Unfortunately we don't have that classification available today, so a simple word blacklist will have to do. Ideally while evaluating query completion we should have the ability to return filtered or un-filtered query completions. ===== Index structure ===== The structure of what we store in elasticsearch and how it gets there. In light of keeping things simple and focusing on commonswiki, a single completion index containing only commonswiki completions will be built and maintained. Data can be moved from analytics to elasticsearch using our existing swift pipeline. This should be able to reuse the index lifecycle maintenance written for glent in the mjolnir-kafka-bulk-daemon. If the work seems promising we can consider how to handle many wikis. An alternate option to consider would be building per-language completion indices, but it's unclear if the query language is usefully shared between sites. For the index mapping the elasticsearch completion suggester seems like a good fit for this use case. While the completion suggester is an in-memory data structure, for the limited use case of commonswiki we can likely ignore resource usage for now. When considering how to expand this to all wikis we will need to take a closer look. It is not clear currently if the completion suggester should be used relatively bare, as CirrusSearch does today, or if we should carry forward additional context information (perhaps supporting filtering of nsfw-flagged queries?) If time permits it would be interesting to apply the same process but sourcing queries by language instead of by wiki. On one side this offers a seemingly graceful solution to the multi-lingual nature of commonswiki, but on the other side the completions will be less specialized to commonswiki and media search, as such they may prove to be less useful. ===== Runtime support ===== The related code necessary to query the new completion candidates at the right time and return results to users. As an autocomplete API the user experience is heavily tied to the latency of the requests. While mediawiki provides reasonable latency there will always be significant overhead as compared to the ~11ms p50 we see from the a golcurrent completion search backend. It's plausible we could integrate to restbase, but spreading search code around to even more locations seems undesirable. The mediawiki api is then the sensible place for query completion runtime support to live. If we wish to integrate to the mediawiki api it's not clear where or how it should be offered. Attaching this as a profile of `prefixsearch` would be wrong, as it is not a prefix search against wiki titles. The sensible alternative is a new top-level api action that provides query completions. Alternatively there are new REST api's in mediawiki, perhaps somewhere in there would be sensible. As an initial exploration into query completion and having the goal of avoiding complexity, we can probably avoid the REST api and it's newness instead providing an experimental api action. There is a remaining question about UI, do we keep a hard split between completion types or do we blend titles and queries into a single display? Again, in trying to keep things simple, no blending will occur on the backend. We could potentially issue multiple api requests and do a naive blend in the frontend for experimental reasons. ===== Analysis ===== user interface event logging and event analysis for autocomplete and search usage. After some literature review the following metrics seem to be used. It's not clear yet which we should be concerned with, but i've included a few here for consideration. * MRR ** Mean Reciprocal Rank looks at the mean rank of relevant results on a per-query basis. ** Varients include MRR weighted by the number of candidates available. The intuition here is that shorter prefixes with more candidates are harder to get right. More weight should be given to good suggestions for `ca` than for `california golden state warrio`. * pSaved / eSaved. ** "pSaved is defined as the probability of using a query suggestion while submitting a query. eSaved equates to the normalized amount of keypresses a user can avoid due to the deployed query suggestion mechanism." - A Survey of Query ** We've previously optimized for this metric on wikidata, and used the actual number of keypresses in an AB test to evaluate how well it worked. * Success Rate at top K (SR@K) ** "denotes the average ratio of the actual query that can be found in the top K query completion candidates over the test data. This metric is widely used for tasks whose ground truth consists of only one instance, such as query completion" * Minimal Keystrokes (MKS) ** Scores "indicate the minimal number of keypresses that the user has to type before a query is submitted by clicking a query completion [Duan and Hsu, 2011]." ===== Future Directions ==== * Completion diversity - Essentially filtering semantically similar queries to provide users with more unique options * Personalization - There are approaches that work within the current search context, not requiring long-term user profiles, that could be investigated * ML approaches are also possible, but generally seem to involve longer term user profiles * Entity detection - On wikis many searches include entities, this is likely why existing title prefix search works so well. Some form of entity detection may be useful, but there is limited research into this field.

===== Motivation ===== The current auto-completion behavior across all wikis is a search against page titles. This is highly successful for encyclopedic content, such as finding the page about a specific topic, but is less than ideal for constructing a full text search query. We suspect auto-completion of search queries could be a significant improvement to the search interface particularly. The initial exploration and implementation will focus on commonswiki and multimedia search, where we expect the current autocomplete is not nearly as successful. ===== Desired Functionality ===== TODO ===== Concerns ===== The primary concern to address is privacy. While query completion could be implemented using a variety of methods, implementations based on submitted user queries are simpler to build and are documented to significantly outperform models built from dictionaries or content, as long as sufficient user interactions are available to feed the system. User queries are PII and should not be directly released. Since providing the query completions to users is a de-facto release we need to deal with this. Typical approaches involve removing the long tail of search queries. The second concern to address is NSFW content. Any sampling of a few dozen search queries is bound to have queries that refer to pornographic content. In initial explorations of the most popular queries starting with various letters, many of the top 5 query lists from commons have results that fall in this category. While we don't take a particular position on the appropriateness of content, we do have prior experience with concerns from communities when search returns nsfw content to seemingly unrelated queries (see `gedit` -> `genit(al)` and commons search is disabled in the enwiki sidebar). It is unclear what the appropriate way to address this is, but likely some consultation outside search platform will be required. ==== Proposal ==== As an initial exploration/implementation, we should try to keep things simple. Proposal ===== Assumptions ===== There is enough historical query data after chopping the long tail to provide useful query suggestions. ===== Candidate generation ===== An offline process run in analytics networks to process query logs and emit scored query completion candidates. After a review of some literature there are some relatively simple scoring methods we can apply that will look at query frequency over time and predict the future frequency of a query. We can prototype with a very simple historical count, and then revisit the scoring with a time series based prediction that can account for seasonality and rate of change (holt-winters?). Completions can be sorted by the expected future frequency with the most popular completions presented to the user. There are more advanced scoring methods possible which take into account personalization and/or ML, but we are rejecting them for now as too-complex. The initial prototype will chop off the long tail of queries to address the privacy concerns. Likely some method of evaluating the effectiveness of various thresholds needs to be decided on. For NSFW queries we can source a blacklist of words and flag queries containing those words. Alternative methods of flagging nsfw queries involve looking at the ratio of SFW to NSFW results returned by the search. Unfortunately we don't have that classification available today, so a simple word blacklist will have to do. Ideally while evaluating query completion we should have the ability to return filtered or un-filtered query completions. ===== Index structure ===== The structure of what we store in elasticsearch and how it gets there. In light of keeping things simple and focusing on commonswiki, a single completion index containing only commonswiki completions will be built and maintained. Data can be moved from analytics to elasticsearch using our existing swift pipeline. This should be able to reuse the index lifecycle maintenance written for glent in the mjolnir-kafka-bulk-daemon. If the work seems promising we can consider how to handle many wikis. An alternate option to consider would be building per-language completion indices, but it's unclear if the query language is usefully shared between sites. For the index mapping the elasticsearch completion suggester seems like a good fit for this use case. While the completion suggester is an in-memory data structure, for the limited use case of commonswiki we can likely ignore resource usage for now. When considering how to expand this to all wikis we will need to take a closer look. It is not clear currently if the completion suggester should be used relatively bare, as CirrusSearch does today, or if we should carry forward additional context information (perhaps supporting filtering of nsfw-flagged queries?) If time permits it would be interesting to apply the same process but sourcing queries by language instead of by wiki. On one side this offers a seemingly graceful solution to the multi-lingual nature of commonswiki, but on the other side the completions will be less specialized to commonswiki and media search, as such they may prove to be less useful. ===== Runtime support ===== The related code necessary to query the new completion candidates at the right time and return results to users. As an autocomplete API the user experience is heavily tied to the latency of the requests. While mediawiki provides reasonable latency there will always be significant overhead as compared to the ~11ms p50 we see from the a golcurrent completion search backend. It's plausible we could integrate to restbase, but spreading search code around to even more locations seems undesirable. The mediawiki api is then the sensible place for query completion runtime support to live. If we wish to integrate to the mediawiki api it's not clear where or how it should be offered. Attaching this as a profile of `prefixsearch` would be wrong, as it is not a prefix search against wiki titles. The sensible alternative is a new top-level api action that provides query completions. Alternatively there are new REST api's in mediawiki, perhaps somewhere in there would be sensible. As an initial exploration into query completion and having the goal of avoiding complexity, we can probably avoid the REST api and it's newness instead providing an experimental api action. There is a remaining question about UI, do we keep a hard split between completion types or do we blend titles and queries into a single display? Again, in trying to keep things simple, no blending will occur on the backend. We could potentially issue multiple api requests and do a naive blend in the frontend for experimental reasons. ===== Analysis ===== user interface event logging and event analysis for autocomplete and search usage. After some literature review the following metrics seem to be used. It's not clear yet which we should be concerned with, but i've included a few here for consideration. * MRR ** Mean Reciprocal Rank looks at the mean rank of relevant results on a per-query basis. ** Varients include MRR weighted by the number of candidates available. The intuition here is that shorter prefixes with more candidates are harder to get right. More weight should be given to good suggestions for `ca` than for `california golden state warrio`. * pSaved / eSaved. ** "pSaved is defined as the probability of using a query suggestion while submitting a query. eSaved equates to the normalized amount of keypresses a user can avoid due to the deployed query suggestion mechanism." - A Survey of Query ** We've previously optimized for this metric on wikidata, and used the actual number of keypresses in an AB test to evaluate how well it worked. * Success Rate at top K (SR@K) ** "denotes the average ratio of the actual query that can be found in the top K query completion candidates over the test data. This metric is widely used for tasks whose ground truth consists of only one instance, such as query completion" * Minimal Keystrokes (MKS) ** Scores "indicate the minimal number of keypresses that the user has to type before a query is submitted by clicking a query completion [Duan and Hsu, 2011]." ===== Future Directions ==== * Completion diversity - Essentially filtering semantically similar queries to provide users with more unique options * Personalization - There are approaches that work within the current search context, not requiring long-term user profiles, that could be investigated * ML approaches are also possible, but generally seem to involve longer term user profiles * Entity detection - On wikis many searches include entities, this is likely why existing title prefix search works so well. Some form of entity detection may be useful, but there is limited research into this field. * A Query->Historical count index has use cases outside simply query completion. For full-text search knowing the query popularity can give important information about if this is a head or a tail query before performing the full-text search. This information can inform additional options such as performing spelling corrections.