Exact title search for page in extra content namespace does not return that page in the first 500 results
Open, NormalPublic

Description

There is a page on the Spanish Wikipedia called Anexo:Ciudades de la India por población (in English, "Appendix:Cities in India by population"). It is in namespace 104, but that's included in $wgNamespacesToBeSearchedDefault.

However, if I search for the page's exact title, minus the namespace, the page doesn't appear anywhere in the first 500 results (excluding the Wikidata search result at the bottom of the page, which does find it). The same is true even if I add the namespace, separated from the title by a space. This happens whether I'm logged in or logged out.

Restricted Application added a project: Discovery. · View Herald TranscriptDec 5 2016, 7:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
neilpquinn renamed this task from Exact title search on the Spanish Wikipedia does not return page in the first 500 results to Exact title search for page in extra content namespace does not return that page in the first 500 results .Dec 5 2016, 7:20 PM
neilpquinn updated the task description. (Show Details)Dec 5 2016, 7:41 PM

it looks like this is getting a great score for full text matching, but is being discounted by 80% due to being in namespace 104. The configuration right now is that even though namespaces 0, 100 and 104 are content namespaces, 100 and 104 get an 80% discount to their score. This feature makes sense for pushing talk pages and such down, but not sure it's the right thing to do here. Will need to ponder.

neilpquinn added a comment.EditedDec 5 2016, 8:03 PM

it looks like this is getting a great score for full text matching, but is being discounted by 80% due to being in namespace 104. The configuration right now is that even though namespaces 0, 100 and 104 are content namespaces, 100 and 104 get an 80% discount to their score. This feature makes sense for pushing talk pages and such down, but not sure it's the right thing to do here. Will need to ponder.

Hmm, interesting! To be honest, I don't think it even makes sense for pushing down talk pages. As far as I know, the only time a namespace other than 0 will be searched is when someone has made an explicit decision that it should be (whether that's the community deciding to add it to $wgNamespacesToBeSearchedDefault or the user checking the box). The only way a talk page is going to searched is if someone wants to search it, in which case I don't see a reason to discount its results.

A very common usage is to search all namespaces, in which case weighting on the namespaces can be pretty important for returning relevant results.

We probably need to reeavaluate namespace boosts, the scoring formula changed completely with BM25. I don't really know how to tune that but given the example described here it seems that we are now way too aggressive...
Sadly we mainly work with english wikipedia which does have a single content namespace.
I can setup an eswiki index in relforge, I can start with this example as a base and make sure this page is in the top 3 but I would need someone to either give more examples or to evaluate new boost values.

Deskana moved this task from Needs triage to Later on the Discovery-Search board.Dec 8 2016, 11:04 PM
Deskana triaged this task as Normal priority.
Deskana added a subscriber: Deskana.

It would be good to look at this some point.