Page MenuHomePhabricator

[EPIC] Tune query result scoring in Simple Search
Open, Needs TriagePublic

Description

Epic task to track work regarding relevancy improvements of the Simple Search wikidata search endpoint.

Context: a new simple search wikidata endpoint was added to WikibaseCirrusSearch. The actual use-cases for this endpoint is quite unclear yet. It was built using a similar scoring strategy used by completion searches but this type of scoring techniques might not be ideal for this endpoint.

Below are points copy/pasted from a conversion regarding possible improvements/ideas to explore:

  • the scoring strategy comes from the prefix search profile, this API is now completely different, I doubt it's fit for your purpose. I wonder for instance, if the reason you're prioritizing phrase search is because you had scoring/precision issues. While phrase search is nice it should not become something that's used too often by users to workaround precision issues.
  • If you found the need to run phrase queries to solve ranking issues, be aware that we sometimes run a phrase query as part of the rescore window. See SearchContext::setPhraseRescore(), it's run as part of the rescore window because phrase searches can be costly. This could be something to look into if you often find, despite tuning the scoring query, that phrasal matches can help ranking.
  • constant score queries are used, these are fine for non-tokenized fields but for .plain fields I'm not too sure, there might be a big information loss here. You could see how we run fulltext searches and perhaps rethink the scoring strategy (see \Wikibase\Search\Elastic\EntityFullTextQueryBuilder)
  • using the score explanation to understand how the scores are computed, by appending &cirrusDumpResult&cirrusExplain=pretty you should get the various score components, e.g. https://www.wikidata.org/w/index.php?search=sweet+potato&title=Special%3ASearch&profile=default&fulltext=1&cirrusDumpResult&cirrusExplain=pretty
  • When adapting the scoring query make sure that the rescore query is still meaningful, there's a delicate balance to keep between the "rescore" function and the scoring query. e.g. you might not want the rescore to take precedence and be the main driver of the ranking, this would simply mean that you sort by incoming_links & sitelinks count. The opposite is possible, you still want the rescore function to help tie break good matches.
  • know the behavior of the various indexed fields and use them wisely, you may solve simple scoring issues pulling the right fields and use the right elastic query (see the index mapping and index analysis config)

Related Objects

Event Timeline

I found some edge case Items that can't currently be searched for with Simple Item Search. I assume this is because the search term get tokenized to an empty string. The filter query might need to be modified to not only search tokenized fields.

After testing the endpoints on production, we realised the results we were getting were as expected (i.e., good/reliable). We've now moved this to our tech backlog and will revisit it after we've collected more data on how phrase matching is used

We'd spoken about this in our call/the engineers from both teams had chatted among themselves and the consensus was that it made the most sense if the WMF Search team took this ticket on as they're the experts.

We'd spoken about this in our call/the engineers from both teams had chatted among themselves and the consensus was that it made the most sense if the WMF Search team took this ticket on as they're the experts.

@Ifrahkhanyaree_WMDE, we have a lot of people out on PTO today, and no one among the few who are here today are familiar with the call. Who did you talk to?

@WMDE-leszek, it would also be nice to know who the original WMF Search Engineer you spoke to is, so we can get more context from them. We can't find a concrete question or problem here. There's not enough context from the conversation before the advice listed in the description.

@Ifrahkhanyaree_WMDE please let us how important this is from your perspective, the Search Platform Team might require some product guidance to better understand what are the use-cases of this endpoint.
In the meantime I'm going to re-word this ticket as an Epic and actual work should probably be filed as actionable subtasks.

I found some edge case Items that can't currently be searched for with Simple Item Search. I assume this is because the search term get tokenized to an empty string. The filter query might need to be modified to not only search tokenized fields.

Such search strings might require targeting the nearmatch fields and have a dedicated filter for those.

dcausse renamed this task from Tune query result scoring in Simple Search to [EPIC] Tune query result scoring in Simple Search.Aug 18 2025, 4:14 PM
dcausse added a project: Epic.
dcausse updated the task description. (Show Details)
dcausse moved this task from needs triage to Wikibase Search on the Discovery-Search board.