Page MenuHomePhabricator

Strike a decent balance between fulltext matches & statement matches
Closed, ResolvedPublic

Description

fulltext matches & statement matches should both influence "how relevant" a file is for a search term, and we should try to find a good balance where having matching depicts statements significantly impacts the score, without overpowering the fulltext scores.

Short example:

  • a file with depicts:cat is probably a better match than a file with only a mention of cat somewhere in the description ("Cat" could be the photographer's first name)
  • a file with multiple mentions of cat all over the place (title, description, caption, ...) but not depicts:cat, is probably more relevant still than one that only has depicts:cat

Things that make this hard:

  • there is no consistency in scores across searches: it all depends on the frequency of search terms within the documents individually and as a whole
  • there is no consistency in full text scores: a search term consisting of multiple words will lead to bigger scores
  • there is no consistency between full text & statement scores: while full text scores will grow with more terms, a statement is always just 1 term

I believe we need to figure out a way to normalize full text scores & statement scores to a similar baseline (though we can't use either score as a baseline, because documents may exist where only fulltext matches, or only statements match)
We can weight specific fields/queries relative to one another, but that doesn't help much until we contain their range (which varies based on search term input, which is beyond our control)

More reading: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/619985/3#message-1992242de1d2c0ef862b47cb74aa2b4e0e9d0ff3

Event Timeline

Change 621518 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Normalize statements & fulltext scores relative-ish to eachother

https://gerrit.wikimedia.org/r/621518

Change 626357 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Turn fulltext signals back into bool query instead of dis_max

https://gerrit.wikimedia.org/r/626357

Change 626358 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Don't phrase rescore media search queries

https://gerrit.wikimedia.org/r/626358

Change 626357 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Turn fulltext signals back into bool query instead of dis_max

https://gerrit.wikimedia.org/r/626357

Change 626358 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Don't phrase rescore media search queries

https://gerrit.wikimedia.org/r/626358

Here's another idea - multiply the statement_keywords boost by the number of words in the Q-item

Note: not number of words, as non-latin languages behave differently & certain words (stopwords) are omitted. We need the tokens from elastic.

Change 628080 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Increase statements boost based on amount of tokens

https://gerrit.wikimedia.org/r/628080

Change 628301 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Replace naive PHP-based token length statement score correction

https://gerrit.wikimedia.org/r/628301

Quick status update: we have 3 patches that will help the situation.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/621518
This one solves the imbalance by flattening the scores, making sure one can not score drastically higher than the other.
Downside is that it loses detail in the scores: a result with poor fulltext score + mediocre statement score is even more likely to beat a result with terrific fulltext score (but without statement)
Statements scores are pretty flat already, though (same statements will have same score for all items they're applied to) so it's not a massive problem, but it does complicate things.
This is not the preferred fix IMO, but might still be worth considering if we do not manage to tweak scores to compensate for significant outliers.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/628080
This soles the imbalance by multiplying the statement score by <number based on amount of tokens in search term>.
It'll essentially boost statement scores more the longer the search term becomes, because that's the effect it already has for fulltext scores (multiple tokens count towards the score)
Downsides of this implementation are:
#1. recognition of amount of tokens is done in PHP and is different from what actually happens in elastic (where stopwords are stripped etc.)
#2. the impact of each additional term on the score is sketchy - it's based on an average based on many popular search terms, but some terms deviate significantly from the average
(this might end up not being much of an issue in practice, though - the longer the search term becomes, the less likely it is to find matching statements, and when it does, it's probably a highly relevant one)

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/628301
Essentially the same solution as above, except that it's done in elastic rather than PHP, so we drop downside #1.

Change 628080 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Increase statements boost based on amount of tokens

https://gerrit.wikimedia.org/r/628080

Change 621518 abandoned by Matthias Mullie:
[mediawiki/extensions/WikibaseMediaInfo@master] Normalize statements & fulltext scores relative-ish to eachother

Reason:
Abandoning in favor of alternative approaches (where statement scores are boosted relative to amount of terms) - can always get back to this if needed, but likely won't be

https://gerrit.wikimedia.org/r/621518

Change 628301 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Replace naive PHP-based token length statement score correction

https://gerrit.wikimedia.org/r/628301

@matthiasmullie looks like all patches are merged now, can this be moved from Code Review? Or does the abandoned patch need to be replaced still?

I believe that https://gerrit.wikimedia.org/r/628301 has not yet been deployed, and I'd like to make sure it works :)

Urgh - I confused "move out of code review" with "close the ticket."
Of course it should be moved :)

Change 639569 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Adjut normalization

https://gerrit.wikimedia.org/r/639569

Change 639569 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Adjust normalization

https://gerrit.wikimedia.org/r/639569

Checked in wmf.18.

Notes:
(1) "pika" - returns File:Encuentro Latinoamericano de Escritores "Redes de Nuevos Escritores, entrelanzando culturas"9.png where the only match for "pika" is a part of a user name. @matthiasmullie - your patch was supposed to eliminate such matches?

(2)

  • "Golden Gate bridge" seems to return only good matches
  • "pika ballet" does not return results which is correct, I suppose.

However

Also, it'd be great if there would be some specific test cases for verifying improvement for search matches. If there are any ideas/advice - I'd be happy to add them to my test cases above.

Sadly, "pika" being picked up when it's part of a username (that's part of the content somehow) is still to be expected. Given that many of the content is free-form, we can't really distinguish between useful information & irrelevant stuff in there, so it'll continue to be found.
The patches in this ticket were mostly about normalizing the scores for input, so that - at the very least - we are better able to tune how much specific fields contribute to the score: really long search terms with multiple words had a tendency to rack up massive scores, but statements could never do that (in which case text matches would always overpower statement matches, even if the text is sometimes barely relevant)

Just checked and the code is in place and doing what it's supposed to do!
But it's only a part of improving the results - next part is figuring out how relevant each field is, and tweaking how much it contributes to the score.

As for that last question - sadly, we don't know any quick way to test how much things are improving. It's hard because:

  • we (humans) evaluate the image, but the search engine can only evaluate the metadata, and a good image might not have good metadata or vice versa (which makes it hard to build a good set of "this image is objectively better than that image", when the data available might not be)
  • it's easy for specific examples to be skewed because of very specific reasons, and improving the situation for 1 term can end up making things worse for another

Checking specific searches is very useful to spot "mistakes", but not so much for evaluating overall slight increases of search relevance - for that we need a lot of data.
So yeah, sadly, I don't have any ideas/advice here...

Change 647008 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Fix normalization factor

https://gerrit.wikimedia.org/r/647008

Change 647008 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Fix normalization factor

https://gerrit.wikimedia.org/r/647008