Cirrus search does not prioritise root pages above their subpages
Closed, ResolvedPublic
Actions

Description

Hi.
I tried to find in the hewiki search box the page "ויקיפדיה:הכה את המומחה", (our Reference desk), and started with "ויקיפדיה:הכה את המ" waiting for autocomplete. I get a list of the desk's subpages with archives (<name>/archive/1 and so on), but not the master page (<name>). I believe it should be prioritised on the subpages.

Details

	Subject	Repo	Branch	Lines +/-
	Add setting for prefix search rescore profile	mediawiki/extensions/CirrusSearch	master	+10 -1

Customize query in gerrit

Related Objects

Mentioned In: T368894: Cirrus search does not prioritise master pages on their subpages
Mentioned Here: T159861: Add an is_subpage field to elasticsearch documents and use as a scoring feature

Event Timeline

IKhitron created this task.Jan 31 2017, 6:54 PM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptJan 31 2017, 6:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This problem does not exist if you get an exact title match, but it does exist if you're typing the page out.

e.g. an English example is searching for "WP:Village pump (techni" gives a bunch of subpages, but "WP:Village pump (technical)".

This really shouldn't be happening.

• Deskana moved this task from needs triage to Up Next on the Discovery-Search board.Feb 2 2017, 11:12 PM

@dcausse Your thoughts on this one would be appreciated!

This looks to primarily be a problem with non-content namespaces, which fallback to prefix search rather than the completion suggestor.

Did a couple spot checks on the 'Village pump (technic' query. The top 10 results are, perhaps not unsurprisingly, with an incredibly similar score. The range is from 125.7292 to 125.42696. The desired page is down at 123.32696 around position 40. The desired page here has plenty more incoming links, but i have to drastically change the weight for it to win. From a weight of 13 up to 45. Could also change the pow on incoming links from n^0.7 to n^0.5, which allows for a weight of 25. This doesn't seem like a great solution.

I think we might want some feature that adjusts the score based on title length (bm25 already takes this into account, but perhaps not aggressively enough?), or perhaps a more direct is_subpage de-boost.

I think one problem is that I changed the rescore profile to a weighted sum when we activated bm25, unfortunately it caused the switch to this new profile for prefix search as well. As Erik noticed boost values are not properly tuned for a weighted sum and prefix queries.
We could perhaps switchback to the product in the meantime? We would need to add a new config var to indicate to cirrus which profile to use for prefixsearches and set it to "classic" :

https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=WP:Village%20pump%20(techni&cirrusRescoreProfile=classic

or:

https://he.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94:%D7%94%D7%9B%D7%94%20%D7%90%D7%AA%20%D7%94%D7%9E&cirrusRescoreProfile=classic

It seems to fix the problem for these particular cases.
I'd like to investigate the solutions suggested by Erik, esp. including the title size and why not recording subpage informations (is_subpage or perhaps subpage_depth) but this would require some considerable work to tune these new signals.
In the meantime I'd suggest to implement a quick fix by adding a new wgCirrusSearchPrefixSearchRescoreProfile var and set it to "classic" by default.

• Deskana moved this task from Up Next to Current work on the Discovery-Search board.Feb 6 2017, 6:27 PM

• Deskana edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Change 341437 had a related patch set uploaded (by ebernhardson):
[mediawiki/extensions/CirrusSearch] Add setting for prefix search rescore profile

https://gerrit.wikimedia.org/r/341437

gerritbot added a project: Patch-For-Review.Mar 6 2017, 9:48 PM

The above patch is the stop-gap solution, reverting prefix search to the old 'classic' rescore profile. I pulled the patch over to sistersearch and tested against the indices we have in relforge and it looks to do as expected. Anyone can test enwiki at http://sistersearch.wmflabs.org/wiki/Main_Page. I also imported the hebrew data to relforge, available at http://he-wp-prod-relforge.wmflabs.org/. This also works as expected.

EBernhardson claimed this task.Mar 6 2017, 10:53 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Indeed, @EBernhardson, but pay attention: The Special:Search page did not find it at all.

full text search will generally perform poorly on partial words. The token המ is not found in the current (as of 1-17-2017, the dump I imported) version of the non-archive page so it's not found. The full text query that might find the token would be ויקיפדיה:הכה את המ* (looks to display wrong, at least in my browser, due to mixed ltr and rtl content. The * should be at the end of the המ token). This does find the page with matching title, although it is around position 43. Again this would probably need to take advantage of an explicit is_subpage feature rather than hoping that the combination of length normalization and incoming_links would help enough to push the results up higher. The pages are all so similar the scores also end up very similar, 130.48738 for the desired page at position 43 and 136.34993 for the archive page that gets top result. Within the content namespaces we would rely on page popularity (roughly page view counts) to differentiate, but those features are only available for content pages. That leaves mostly the count of incoming links to differentiate with, but the difference is too small. It seems the archive page's have the incoming link counts somewhat inflated by having all the archive pages link together. This is fine, but it prevents the incoming_links fields from being able to be used to push the more popular page up to the top.

I see, thank you, for the explanation and for the implementation, @EBernhardson.

dcausse moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Mar 7 2017, 6:10 PM

I created T159861 for adding a more explicit is_subpage signal to use for adjusting full text and prefix search scoring.

Change 341437 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch] Add setting for prefix search rescore profile

https://gerrit.wikimedia.org/r/341437

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-03-14_(1.29.0-wmf.16)).Mar 9 2017, 7:00 PM

• Deskana closed this task as Resolved.Mar 22 2017, 11:05 AM

I think the task should be reopened, there were a couple of cases in the last few days.

Maintenance_bot removed a project: Patch-For-Review.Mon, Jul 1, 12:30 AM

This task covers a problem which happened 7 years ago. If there is some similar problem nowadays, feel free to file a bug report with clear steps. Thanks.

Done.

MGChecker renamed this task from Cirrus search does not prioritise master pages on their subpages to Cirrus search does not prioritise root pages above their subpages.Mon, Jul 1, 11:08 PM

Cirrus search does not prioritise root pages above their subpagesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Cirrus search does not prioritise root pages above their subpages
Closed, ResolvedPublic
Actions