Page MenuHomePhabricator

Adjust CirrusSearchNamespaceWeights for Commons
Closed, ResolvedPublic3 Estimated Story Points

Description

On Commons NS_MAIN is used for gallery pages, which are user-curated collections of images ... essentially hand-curated search results which are very often not well-maintained

Categories tend to be maintained better than galleries, but searching tends to return galleries higher in the results than categories (probably because gallery pages are likely to contain more matching text)

We have a community wish to change this https://meta.wikimedia.org/wiki/Community_Wishlist/Wishes/When_searching_Commons,_under_%22Categories_and_Pages%22_show_the_category_for_the_search_term

A possible way to do this would be to change the weights for galleries in CirrusSearchNamespaceWeights, e.g.

'wgCirrusSearchNamespaceWeights' => [
	'commonswiki' => [ 
		1 => 0.9,
	],

Event Timeline

pfischer set the point value for this task to 2.Aug 11 2025, 3:57 PM
pfischer changed the point value for this task from 2 to 3.

Change #1182186 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Reduce galleries weight in search on commons

https://gerrit.wikimedia.org/r/1182186

To get an idea of what an appropriate weight would be i ran some stats against an hour of incoming requests. Note that the first search result is considered position 1. Also note that this is not re-running the queries, it is applying custom weights to the scores and re-sorting the results that were provided. The true mean will likely be larger than presented here as galleries are pushed down and new results come into the result list.

Filters:

  • Requests issued to Special:Search on commonswiki
  • Search results must have both gallery and category results
  • Request must not have provided a custom limit parameter
gallery weightmean gallery posmean category pos
1.06.8610.44
0.98.3710.33
0.89.5610.27
0.710.0610.18
0.610.3710.07
0.510.659.98
0.410.979.90
0.311.429.86
0.213.709.48
0.118.329.21

With the goal of not preferring galleries over categories, a weight of 0.6 seems reasonable to start with. Categories also have a very low weight, we could consider increasing them if this change is found to be insufficient.

Change #1182186 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Reduce galleries weight in search on commons

https://gerrit.wikimedia.org/r/1182186

Mentioned in SAL (#wikimedia-operations) [2025-09-18T07:06:00Z] <dcausse@deploy1003> Started scap sync-world: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]]

Mentioned in SAL (#wikimedia-operations) [2025-09-18T07:12:10Z] <dcausse@deploy1003> dcausse, ebernhardson: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-09-18T07:22:21Z] <dcausse@deploy1003> Finished scap sync-world: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] (duration: 16m 20s)

Deployed the change and tested few queries, I could definitely see some impact (the queries below did not return results from the category namespace in the top10):

  • dog breeds -> Category:Dog Breeds is now in the top-3
  • university calcutta -> Category:University of Calcutta

Unfortunately the example from the wishlist Roses still does not return Category:Roses nowhere near the top-10. Unsure what to do next?

  • Continuing to lower NS_MAIN boost
  • Boost NS_MAIN & NS_CATEGORY to 1 so that they're "equal".

You know this already @dcausse, but just wanted to have it in the ticket comments for reference - Category:Roses is just an empty page with the template {{category redirect|Rosa}}, so perhaps that's why it's not appearing in the results?

You know this already @dcausse, but just wanted to have it in the ticket comments for reference - Category:Roses is just an empty page with the template {{category redirect|Rosa}}, so perhaps that's why it's not appearing in the results?

It's definitely not helping but if you remove NS_MAIN it's the first result: https://commons.wikimedia.org/w/index.php?search=roses&title=Special%3ASearch&profile=advanced&fulltext=1&ns12=1&ns14=1&ns100=1&ns106=1
Generally I agree that if the search term matches almost perfectly the page name it should be in the top-3. This is different for the other example with camel -> Category:Camels which is imo harder to get right.

Using the pageid filter we can get an explain that contains only the top three results and the target category:

https://commons.wikimedia.org/w/index.php?search=roses+pageid%3A11071939%7C26885175%7C10972966%7C1506267&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1&ns12=1&ns14=1&ns100=1&ns106=1&cirrusDumpResult&cirrusExplain=pretty

Roses
	
18.003006 product of:
  360.06012 Sum of the following:
    343.50336 Dismax (take winner of below)
      343.50336 all_near_match:roses
      257.62753 all_near_match.asciifolding:roses
    9.873508 title:rose
    3.8860097 title.plain:roses
    2.7972603 suggest:roses
  0.05 Minimum Of:
    0.05 function score, score mode [multiply]
      0.25 function score, product of:
        1 match filter: template:template:negative boosted template
        0.25 f( -- constant weight -- ) = 0.25
      0.2 function score, product of:
        1 match filter: namespace:{14 100 106}
        0.2 f( -- constant weight -- ) = 0.2
    3.4028235E+38 maxBoost

I think the reason this scores so poorly is that the category redirect template applies template:negative boosted template. Without that de-boost it would have scored ~72, putting it well ahead of the next highest result with ~31.

I guess the open question is, should an exact title match somehow go around deboosts?

I guess the open question is, should an exact title match somehow go around deboosts?

Intuitively - yes, I think so. Is there any way to confirm that with data?

I guess the open question is, should an exact title match somehow go around deboosts?

Intuitively - yes, I think so. Is there any way to confirm that with data?

We have a tool that allows us to run large batches of queries under multiple search profiles and then compare stats about the result sets, things like queries with new results in top 3, queries that see no change in results, queries that reshuffle the top few results, etc. It generates a report that additionally includes query/result samples of different classes of change. We haven't used it in several years though, so I imagine it will take at least a day or two of work to get it up and running again with whatever has changed in those years. If we want to evaluate a significant increase in the weight of the all_near_match fields that is probably the direction to go.

Ok grand ... ought i create a separate ticket for that?

yea lets create a separate ticket as it will likely involve a few days work.

Is the related wish resolved?

I believe so, but as with many things the users requesting the change would have a better idea than we do if their actual goals were met. The change was AB tested, analysis of the AB test concluded "The test treatment showed either improvement or no-change across all metrics. This change should be rolled out to all users of commonswiki.". As such we deployed the configuration change on Dec 1.

Full summary and link to AB test report can be found at: https://phabricator.wikimedia.org/T408154#11421087