Page MenuHomePhabricator

Cirrus search does not prioritise master pages on their subpages
Closed, ResolvedPublic

Description

Hi.
I tried to find in the hewiki search box the page "ויקיפדיה:הכה את המומחה", (our Reference desk), and started with "ויקיפדיה:הכה את המ" waiting for autocomplete. I get a list of the desk's subpages with archives (<name>/archive/1 and so on), but not the master page (<name>). I believe it should be prioritised on the subpages.

Event Timeline

IKhitron created this task.Jan 31 2017, 6:54 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptJan 31 2017, 6:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana triaged this task as Normal priority.Feb 2 2017, 11:12 PM
Deskana added a subscriber: Deskana.

This problem does not exist if you get an exact title match, but it does exist if you're typing the page out.

e.g. an English example is searching for "WP:Village pump (techni" gives a bunch of subpages, but "WP:Village pump (technical)".

This really shouldn't be happening.

@dcausse Your thoughts on this one would be appreciated!

This looks to primarily be a problem with non-content namespaces, which fallback to prefix search rather than the completion suggestor.

EBernhardson added a comment.EditedFeb 3 2017, 1:36 AM

Did a couple spot checks on the 'Village pump (technic' query. The top 10 results are, perhaps not unsurprisingly, with an incredibly similar score. The range is from 125.7292 to 125.42696. The desired page is down at 123.32696 around position 40. The desired page here has plenty more incoming links, but i have to drastically change the weight for it to win. From a weight of 13 up to 45. Could also change the pow on incoming links from n^0.7 to n^0.5, which allows for a weight of 25. This doesn't seem like a great solution.

I think we might want some feature that adjusts the score based on title length (bm25 already takes this into account, but perhaps not aggressively enough?), or perhaps a more direct is_subpage de-boost.

I think one problem is that I changed the rescore profile to a weighted sum when we activated bm25, unfortunately it caused the switch to this new profile for prefix search as well. As Erik noticed boost values are not properly tuned for a weighted sum and prefix queries.
We could perhaps switchback to the product in the meantime? We would need to add a new config var to indicate to cirrus which profile to use for prefixsearches and set it to "classic" :

or:

It seems to fix the problem for these particular cases.
I'd like to investigate the solutions suggested by Erik, esp. including the title size and why not recording subpage informations (is_subpage or perhaps subpage_depth) but this would require some considerable work to tune these new signals.
In the meantime I'd suggest to implement a quick fix by adding a new wgCirrusSearchPrefixSearchRescoreProfile var and set it to "classic" by default.

Change 341437 had a related patch set uploaded (by ebernhardson):
[mediawiki/extensions/CirrusSearch] Add setting for prefix search rescore profile

https://gerrit.wikimedia.org/r/341437

The above patch is the stop-gap solution, reverting prefix search to the old 'classic' rescore profile. I pulled the patch over to sistersearch and tested against the indices we have in relforge and it looks to do as expected. Anyone can test enwiki at http://sistersearch.wmflabs.org/wiki/Main_Page. I also imported the hebrew data to relforge, available at http://he-wp-prod-relforge.wmflabs.org/. This also works as expected.

IKhitron added a comment.EditedMar 6 2017, 11:09 PM

Indeed, @EBernhardson, but pay attention: The Special:Search page did not find it at all.

full text search will generally perform poorly on partial words. The token המ is not found in the current (as of 1-17-2017, the dump I imported) version of the non-archive page so it's not found. The full text query that might find the token would be ויקיפדיה:הכה את המ* (looks to display wrong, at least in my browser, due to mixed ltr and rtl content. The * should be at the end of the המ token). This does find the page with matching title, although it is around position 43. Again this would probably need to take advantage of an explicit is_subpage feature rather than hoping that the combination of length normalization and incoming_links would help enough to push the results up higher. The pages are all so similar the scores also end up very similar, 130.48738 for the desired page at position 43 and 136.34993 for the archive page that gets top result. Within the content namespaces we would rely on page popularity (roughly page view counts) to differentiate, but those features are only available for content pages. That leaves mostly the count of incoming links to differentiate with, but the difference is too small. It seems the archive page's have the incoming link counts somewhat inflated by having all the archive pages link together. This is fine, but it prevents the incoming_links fields from being able to be used to push the more popular page up to the top.

IKhitron added a comment.EditedMar 7 2017, 12:16 AM

I see, thank you, for the explanation and for the implementation, @EBernhardson.

I created T159861 for adding a more explicit is_subpage signal to use for adjusting full text and prefix search scoring.

Change 341437 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch] Add setting for prefix search rescore profile

https://gerrit.wikimedia.org/r/341437

Deskana closed this task as Resolved.Mar 22 2017, 11:05 AM