Page MenuHomePhabricator

Increase the near match weight
Closed, ResolvedPublic

Description

Seen in T196165.
Since we switched to BM25 and dropped the coord factor queries with a lot of words may have greater scores than a perfect full title match.
Currently configured at 2 we should bump it to something around 10 so that it can't happen.
Config var is CirrusSearchNearMatchWeight.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Some test cases from T196165 are below. These were searched for on Feb 13, 2020 on the Commons Special:Search page without the File: namespace prefix. API search links are also provided for some.

Ideally, all of these should be in 1st place after the change, but if they make it into the top 10 they would be caught by the P‍18 search change made in T196165. (Eglise Notre Dame de l'Assomption.JPG has so many exact matches and near matches that I wouldn't be shocked if it didn't make it into the top 10.)

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

For the test set above, a weight of 10 seems to be sufficient. 5 of the above queries are "fixed" at that weight. At 5 we only fix 4, and at 15 the only changes are past the top 5 results. Ran a commonswiki query sample with a near match weight of 10:

Query Count: 14503
Zero Results Rate: 0.6%
Poorly Performing Percentage: 6.8%
Top 1 Unsorted Results Differ: 4.2%
Top 3 Sorted Results Differ: 10.1%
Top 3 Unsorted Results Differ: 7.6%
Top 5 Sorted Results Differ: 15.9%
Top 5 Unsorted Results Differ: 9.5%
Top 20 Sorted Results Differ: 49.4%
Top 20 Unsorted Results Differ: 15.7%

Looking over the changes in top 1/3/5, the main new thing being brought up are exact category matches. There are also a variety of other minor reorderings, but in a review they don't seem particularly material. Unfortunately our data labeling process doesn't work on commonswiki (or at least, hasn't been verified and likely needs updates). We also don't have any click based metrics like MAP, MRR, etc in relforge_engine_score. Should perhaps consider working up integration for RRE instead of using relforge_engine_score, but that's out of scope for this ticket.

Bumping the weight doesn't seem particularly dangerous. We don't know if bringing the full category matches higher up is useful. In theory it should be, but categories are incomplete and putting them up high may encourage users to not look at the search results. We don't have anything that says which is better, so lets run with it on commonswiki for now and can evaluate making it a CirrusSearch default sometime in the future.

Change 580394 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] cirrus: Increase commonswiki near match weight

https://gerrit.wikimedia.org/r/580394

Sounds good!

We don't know if bringing the full category matches higher up is useful. In theory it should be, but categories are incomplete and putting them up high may encourage users to not look at the search results.

As a user of commons as it currently is, I like category matches. My search goal often is to find a likely-looking image and then look through its categories for more, similar images.

I know that data is not the plural of anecdote, but I'm hopeful that it will help, and that at least it won't hurt.

Change 580394 merged by jenkins-bot:
[operations/mediawiki-config@master] cirrus: Increase commonswiki near match weight

https://gerrit.wikimedia.org/r/580394

Mentioned in SAL (#wikimedia-operations) [2020-04-06T11:17:32Z] <awight@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:580394|cirrus: Increase commonswiki near match weight (T245642)]] (duration: 00m 59s)