Epic task to track work regarding relevancy improvements of the Simple Search wikidata search endpoint.
Context: a new simple search wikidata endpoint was added to WikibaseCirrusSearch. The actual use-cases for this endpoint is quite unclear yet. It was built using a similar scoring strategy used by completion searches but this type of scoring techniques might not be ideal for this endpoint.
Below are points copy/pasted from a conversion regarding possible improvements/ideas to explore:
- the scoring strategy comes from the prefix search profile, this API is now completely different, I doubt it's fit for your purpose. I wonder for instance, if the reason you're prioritizing phrase search is because you had scoring/precision issues. While phrase search is nice it should not become something that's used too often by users to workaround precision issues.
- If you found the need to run phrase queries to solve ranking issues, be aware that we sometimes run a phrase query as part of the rescore window. See SearchContext::setPhraseRescore(), it's run as part of the rescore window because phrase searches can be costly. This could be something to look into if you often find, despite tuning the scoring query, that phrasal matches can help ranking.
- constant score queries are used, these are fine for non-tokenized fields but for .plain fields I'm not too sure, there might be a big information loss here. You could see how we run fulltext searches and perhaps rethink the scoring strategy (see \Wikibase\Search\Elastic\EntityFullTextQueryBuilder)
- using the score explanation to understand how the scores are computed, by appending &cirrusDumpResult&cirrusExplain=pretty you should get the various score components, e.g. https://www.wikidata.org/w/index.php?search=sweet+potato&title=Special%3ASearch&profile=default&fulltext=1&cirrusDumpResult&cirrusExplain=pretty
- When adapting the scoring query make sure that the rescore query is still meaningful, there's a delicate balance to keep between the "rescore" function and the scoring query. e.g. you might not want the rescore to take precedence and be the main driver of the ranking, this would simply mean that you sort by incoming_links & sitelinks count. The opposite is possible, you still want the rescore function to help tie break good matches.
- know the behavior of the various indexed fields and use them wisely, you may solve simple scoring issues pulling the right fields and use the right elastic query (see the index mapping and index analysis config)