Page MenuHomePhabricator

Exact title search for page in extra content namespace does not return that page in the first 500 results
Closed, ResolvedPublic

Description

There is a page on the Spanish Wikipedia called Anexo:Ciudades de la India por población (in English, "Appendix:Cities in India by population"). It is in namespace 104, but that's included in $wgNamespacesToBeSearchedDefault.

However, if I search for the page's exact title, minus the namespace, the page doesn't appear anywhere in the first 500 results (excluding the Wikidata search result at the bottom of the page, which does find it). The same is true even if I add the namespace, separated from the title by a space. This happens whether I'm logged in or logged out.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
nshahquinn renamed this task from Exact title search on the Spanish Wikipedia does not return page in the first 500 results to Exact title search for page in extra content namespace does not return that page in the first 500 results .Dec 5 2016, 7:20 PM

it looks like this is getting a great score for full text matching, but is being discounted by 80% due to being in namespace 104. The configuration right now is that even though namespaces 0, 100 and 104 are content namespaces, 100 and 104 get an 80% discount to their score. This feature makes sense for pushing talk pages and such down, but not sure it's the right thing to do here. Will need to ponder.

it looks like this is getting a great score for full text matching, but is being discounted by 80% due to being in namespace 104. The configuration right now is that even though namespaces 0, 100 and 104 are content namespaces, 100 and 104 get an 80% discount to their score. This feature makes sense for pushing talk pages and such down, but not sure it's the right thing to do here. Will need to ponder.

Hmm, interesting! To be honest, I don't think it even makes sense for pushing down talk pages. As far as I know, the only time a namespace other than 0 will be searched is when someone has made an explicit decision that it should be (whether that's the community deciding to add it to $wgNamespacesToBeSearchedDefault or the user checking the box). The only way a talk page is going to searched is if someone wants to search it, in which case I don't see a reason to discount its results.

A very common usage is to search all namespaces, in which case weighting on the namespaces can be pretty important for returning relevant results.

We probably need to reeavaluate namespace boosts, the scoring formula changed completely with BM25. I don't really know how to tune that but given the example described here it seems that we are now way too aggressive...
Sadly we mainly work with english wikipedia which does have a single content namespace.
I can setup an eswiki index in relforge, I can start with this example as a base and make sure this page is in the top 3 but I would need someone to either give more examples or to evaluate new boost values.

Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.
Deskana subscribed.

It would be good to look at this some point.

near match weight has been tweaked in T257922, this might affect (or even fix) this ticket as well. @TJones / @dcausse: any idea if the status of this ticket should be updated now that T257922 is closed?

TJones claimed this task.

This specific problem should be fixed now, since we increased the near match weight globally in T257922, and the example given works.

@neilpquinn wrote:
The same is true even if I add the namespace, separated from the title by a space.

Adding Anexo without the colon means it is treated as a search term rather than a namespace, so that still doesn't give the desired result, but that is to be expected. Trying to recognize namespaces or keywords without their colons and then doing something clever and correct is outside the scope of this ticket, which is limited to exact title matches.

We probably need to reeavaluate namespace boosts, the scoring formula changed completely with BM25. I don't really know how to tune that but given the example described here it seems that we are now way too aggressive...

I agree, though it is also outside the scope of this ticket. It would be challenging to gather data on queries intending to find (or allowing) results outside the main namespaces to use for optimizing ranking. Unfortunatel,y, it's not on our radar right now.