Page MenuHomePhabricator

When calling the Wikilambda label search API with multi-token search terms (with stopwords), the exact match is often not returned.
Closed, ResolvedPublicBUG REPORT

Description

Description

Since tokenizing search terms for making more ample searches, we are often seeing that exact matches don't come on the first batch of searches (or not at all). This is due to the use of tokenized stop words as part of the search query.

Steps to reproduce:

  1. The call to wikilambdasearch_labels with the searchterm "index match list" returns "index of match in list" as first result (see in api sandbox)
  2. The call to wikilambdasearch_labels with the searchterm "index of match in list" does not return the the function with the exact match (see in api sandbox)

Observed behavior:

Stop words bring too many results back, and not all are passed through the match rate algorithm (as it's costly), so they might fall behind unmatched.

We should make sure to always search for the whole substring rather than only depending on tokens.

Similarly, a search using tokens should be able to exclude stop words and only search results that match relevant tokens. However, there is no way we can exclude stop words from a multilingual search without using external tools, so we can compromise with excluding short words. By adding the whole substring to the search conditions, we make sure that intentional short strings are not excluded, and we treat token search as additional content, only when tokens are long enough to be considered (e.g. longer than 2 characters)


Completion checklist

Event Timeline

Change #1276666 had a related patch set uploaded (by Genoveva Galarza; author: Genoveva Galarza):

[mediawiki/extensions/WikiLambda@master] Include whole untokenized substring plus heavier tokens in label search

https://gerrit.wikimedia.org/r/1276666

Change #1276666 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Include whole untokenized substring plus heavier tokens in label search

https://gerrit.wikimedia.org/r/1276666