Increase the near match weight
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Feb 19 2020, 4:53 PM

Description

Seen in T196165.
Since we switched to BM25 and dropped the coord factor queries with a lot of words may have greater scores than a perfect full title match.
Currently configured at 2 we should bump it to something around 10 so that it can't happen.
Config var is CirrusSearchNearMatchWeight.

Details

	Subject	Repo	Branch	Lines +/-
	cirrus: Increase commonswiki near match weight	operations/mediawiki-config	master	+5 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		EBernhardson	T245642 Increase the near match weight
		Resolved		EBernhardson	T247363 Update relforge_relevance package for python 3.x

Event Timeline

dcausse created this task.Feb 19 2020, 4:53 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptFeb 19 2020, 4:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

TJones mentioned this in T196165: Commons image: when pasting the exact title, get the correct file first in the suggester.Feb 19 2020, 7:43 PM

Some test cases from T196165 are below. These were searched for on Feb 13, 2020 on the Commons Special:Search page without the File: namespace prefix. API search links are also provided for some.

1st: Benouville Churchyard -9.JPG
1st: Rue des grands carmes 5 -9.jpg
1st: Gmunden Kammerhofgasse 3 Arkadenhof -9173.jpg
1st: Dug-out Zonnebeke -12.jpg
2nd: Bourges - avenue du 95e-de-Ligne - Portail Saint-Ursin -991.jpg
5th: Lannes (Lot-et-Garonne) - Église Sainte-Marie - Vitraux -9.JPG
6th: Steyr Michaelerkirche Bürgerspital -9659.jpg
10th: Barcelonnette - Villa du Parc du Mercantour -984.jpg
34th: Eglise Notre Dame de l'Assomption.JPG

Ideally, all of these should be in 1st place after the change, but if they make it into the top 10 they would be caught by the P‍18 search change made in T196165. (Eglise Notre Dame de l'Assomption.JPG has so many exact matches and near matches that I wouldn't be shocked if it didn't make it into the top 10.)

EBernhardson triaged this task as Medium priority.Mar 3 2020, 6:44 PM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Mar 4 2020, 9:16 PM

For the test set above, a weight of 10 seems to be sufficient. 5 of the above queries are "fixed" at that weight. At 5 we only fix 4, and at 15 the only changes are past the top 5 results. Ran a commonswiki query sample with a near match weight of 10:

Query Count: 14503
Zero Results Rate: 0.6%
Poorly Performing Percentage: 6.8%
Top 1 Unsorted Results Differ: 4.2%
Top 3 Sorted Results Differ: 10.1%
Top 3 Unsorted Results Differ: 7.6%
Top 5 Sorted Results Differ: 15.9%
Top 5 Unsorted Results Differ: 9.5%
Top 20 Sorted Results Differ: 49.4%
Top 20 Unsorted Results Differ: 15.7%

Looking over the changes in top 1/3/5, the main new thing being brought up are exact category matches. There are also a variety of other minor reorderings, but in a review they don't seem particularly material. Unfortunately our data labeling process doesn't work on commonswiki (or at least, hasn't been verified and likely needs updates). We also don't have any click based metrics like MAP, MRR, etc in relforge_engine_score. Should perhaps consider working up integration for RRE instead of using relforge_engine_score, but that's out of scope for this ticket.

Bumping the weight doesn't seem particularly dangerous. We don't know if bringing the full category matches higher up is useful. In theory it should be, but categories are incomplete and putting them up high may encourage users to not look at the search results. We don't have anything that says which is better, so lets run with it on commonswiki for now and can evaluate making it a CirrusSearch default sometime in the future.

EBernhardson claimed this task.Mar 17 2020, 5:48 PM

Change 580394 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] cirrus: Increase commonswiki near match weight

https://gerrit.wikimedia.org/r/580394

gerritbot added a project: Patch-For-Review.Mar 17 2020, 5:51 PM

EBernhardson moved this task from Ready for Dev -- SWE to Needs review on the Discovery-Search (Current work) board.Mar 17 2020, 5:52 PM

Sounds good!

We don't know if bringing the full category matches higher up is useful. In theory it should be, but categories are incomplete and putting them up high may encourage users to not look at the search results.

As a user of commons as it currently is, I like category matches. My search goal often is to find a likely-looking image and then look through its categories for more, similar images.

I know that data is not the plural of anecdote, but I'm hopeful that it will help, and that at least it won't hurt.

TJones closed subtask T247363: Update relforge_relevance package for python 3.x as Resolved.Mar 18 2020, 2:46 PM

Change 580394 merged by jenkins-bot:
[operations/mediawiki-config@master] cirrus: Increase commonswiki near match weight

https://gerrit.wikimedia.org/r/580394

Mentioned in SAL (#wikimedia-operations) [2020-04-06T11:17:32Z] <awight@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:580394|cirrus: Increase commonswiki near match weight (T245642)]] (duration: 00m 59s)

Maintenance_bot removed a project: Patch-For-Review.Apr 6 2020, 12:11 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Apr 13 2020, 11:09 PM

Gehel closed this task as Resolved.Apr 20 2020, 5:30 PM

dcausse mentioned this in T257922: Search does not return exact title match.Jul 14 2020, 1:19 PM

Increase the near match weightClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Increase the near match weight
Closed, ResolvedPublic
Actions

Related Objects
Search...