Description
Since tokenizing search terms for making more ample searches, we are often seeing that exact matches don't come on the first batch of searches (or not at all). This is due to the use of tokenized stop words as part of the search query.
Steps to reproduce:
- The call to wikilambdasearch_labels with the searchterm "index match list" returns "index of match in list" as first result (see in api sandbox)
- The call to wikilambdasearch_labels with the searchterm "index of match in list" does not return the the function with the exact match (see in api sandbox)
Observed behavior:
Stop words bring too many results back, and not all are passed through the match rate algorithm (as it's costly), so they might fall behind unmatched.
We should make sure to always search for the whole substring rather than only depending on tokens.
Similarly, a search using tokens should be able to exclude stop words and only search results that match relevant tokens. However, there is no way we can exclude stop words from a multilingual search without using external tools, so we can compromise with excluding short words. By adding the whole substring to the search conditions, we make sure that intentional short strings are not excluded, and we treat token search as additional content, only when tokens are long enough to be considered (e.g. longer than 2 characters)
Completion checklist
- Before closing this task, review one by one the checklist available here: https://www.mediawiki.org/wiki/Abstract_Wikipedia_team/Definition_of_Done#Front-end_Task/Bug_Completion_Checklist