Page MenuHomePhabricator

Adapt search ranking for mul language code
Closed, ResolvedPublic5 Estimated Story Points

Description

As an editor I want important entities ranked higher in my search even if their label and alias number is reduced by the use of the mul language code

Problem:
We are rolling out the new mul language code. For details see T285156. This will help significantly reduce the number of labels and aliases in Wikidata and by extension reduce the amount of data in the Wikidata Query Service.

One side-effect of removing a lot of labels from an Item is that they drop in the ranking in search compared to Items that have more labels.

We do want to encourage more use of the mul language code, so we should probably adapt the ranking to take into account that important Items might have only few labels now.

Screenshots:

image.png (530×344 px, 52 KB)

A search for Douglas Adams after many labels have been removed from Q42.

Acceptance criteria:

  • Search result ranking is not significantly made worse by collapsing a lot of labels of an Item into one with the mul language code.

Open questions:

  • Should we generally decrease the importance of the number of labels for ranking or do something special for mul?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Should we generally decrease the importance of the number of labels for ranking or do something special for mul?

I wonder if it would make sense to pretend that items with a mul label also have a label in every other language, at least for ranking purposes. (We probably don’t want to pretend this for haslabel: search purposes… or do we?)

But this might also introduce a different unwanted bias in the search results. mul labels are more suited for some types of items than others (see also the help page) – if we were to score all items with mul labels as if they had hundreds of labels, that might bias the search results in favor of e.g. people and against e.g. creative works.

Should we generally decrease the importance of the number of labels for ranking or do something special for mul?

I'm not sure how much of a correlation there really is between number of labels and whether it's what someone is looking for. I would expect things with lots of labels to generally either be things with lots of sitelinks (because bots add missing labels when there's a sitelink) or things where bots can easily copy labels from one language to lots of others. For the former, you can look at the number of sitelinks directly. The latter only reflects how automatable the labels are.

I imagine the most important things would be how well the search term matches the label or an alias in the current language (or one of the fallback languages, including mul), how much usage it has (sitelinks, backlinks, entity usage - the more often it's already been used, the more likely it is that someone will want to find/use it again), and whether it makes sense for the current context (when using P407 (language of work or name) I expect "pt" to find Portuguese, when using P17 (country) I expect "pt" to find Portugal, and when selecting a unit I expect "pt" to find things like pint and point).

Also, the search results have been awful for a long time when the UI is set to British English (T334563). mul will probably have the same issues.

dr0ptp4kt set the point value for this task to 5.Aug 5 2024, 3:45 PM

It looks like some Items that only have a mul label are ranked so low that they're not showing up in search at all.

e.g. The search term 'Casey Szilvia' doesn't show the Item although it exists: Casey Szilvia (Q128347219)

The mul labels and descriptions (can we have mul descriptions?) are currently not indexed and explains to some degree why search is behaving poorly on these items. We'll index those and see how it performs, tuning search might come as a separate step.
If I'm not mistaken mul is considered a fallback for all languages so it should always be queried.

mul only applies to labels and aliases (not descriptions) and is used as a default label and aliases for any languages that do not have any. This will replace the repetition of labels and aliases across languages for names etc., which means it should always be queried.

Change #1060429 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Add support for mul

https://gerrit.wikimedia.org/r/1060429

Change #1060430 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] search: index stems for mul labels

https://gerrit.wikimedia.org/r/1060430

Change #1060433 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] search: use the stem field when search mul labels

https://gerrit.wikimedia.org/r/1060433

The procedure should be:

Regarding fallbacks WikibaseCirrusSearch is relying on \Wikibase\Lib\TermLanguageFallbackChain::getFetchLanguageCodes, the order in which these languages are returned is quite important as well as the weight attributed to such matches are inversely proportional to its position in this array.

Change #1060449 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] search: use mul fallback for manually-tuned search profiles

https://gerrit.wikimedia.org/r/1060449

Change #1060430 merged by jenkins-bot:

[operations/mediawiki-config@master] search: index stems for mul labels

https://gerrit.wikimedia.org/r/1060430

Mentioned in SAL (#wikimedia-operations) [2024-08-08T07:02:30Z] <dcausse@deploy1003> Started scap sync-world: Backport for [[gerrit:1060430|search: index stems for mul labels (T371401)]]

Mentioned in SAL (#wikimedia-operations) [2024-08-08T07:04:41Z] <dcausse@deploy1003> dcausse: Backport for [[gerrit:1060430|search: index stems for mul labels (T371401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-08-08T07:11:34Z] <dcausse@deploy1003> Finished scap: Backport for [[gerrit:1060430|search: index stems for mul labels (T371401)]] (duration: 09m 03s)

Mentioned in SAL (#wikimedia-operations) [2024-08-08T07:32:03Z] <dcausse> T371401: reindexing testwikidatawiki to index mul labels

Mentioned in SAL (#wikimedia-operations) [2024-08-08T08:30:04Z] <dcausse> T371401: reindexing wikidatawiki@eqiad to index mul labels

Mentioned in SAL (#wikimedia-operations) [2024-08-08T12:22:31Z] <dcausse> T371401: reindexing wikidatawiki@codfw to index mul labels

Current status:

https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060433 appears to be not-needed in the end, I don't see where we use these manually tuned profiles

After some review it turns out the code that used the language-tuned profiles was lost as part of splitting WikibaseCirrusSearch out of the Wikibase repo. All the related machinery still exists and it would be pretty easy to add it back in now, but I wonder if we should be doing testing of some sort to verify those profiles are better than the defaults we've been using. They were at the time, but user behaviour can change in the 5 years that have passed.

https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060433 appears to be not-needed in the end, I don't see where we use these manually tuned profiles

After some review it turns out the code that used the language-tuned profiles was lost as part of splitting WikibaseCirrusSearch out of the Wikibase repo. All the related machinery still exists and it would be pretty easy to add it back in now, but I wonder if we should be doing testing of some sort to verify those profiles are better than the defaults we've been using. They were at the time, but user behaviour can change in the 5 years that have passed.

Agreed, we should definitely re-add the logic that allows to switch the profile based on the context language.

Change #1070331 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Re-introduce per-language profiles

https://gerrit.wikimedia.org/r/1070331

Change #1070331 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Re-introduce per-language profiles

https://gerrit.wikimedia.org/r/1070331

Change #1060449 merged by jenkins-bot:

[operations/mediawiki-config@master] search: use mul fallback for fine-tuned search profiles

https://gerrit.wikimedia.org/r/1060449

Mentioned in SAL (#wikimedia-operations) [2024-09-05T13:24:50Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1060449|search: use mul fallback for fine-tuned search profiles (T371401)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-05T13:26:53Z] <hashar@deploy1003> hashar, dcausse: Backport for [[gerrit:1060449|search: use mul fallback for fine-tuned search profiles (T371401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-05T13:31:19Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1060449|search: use mul fallback for fine-tuned search profiles (T371401)]] (duration: 06m 28s)

Change #1060433 merged by jenkins-bot:

[operations/mediawiki-config@master] search: use the stem field when searching mul labels

https://gerrit.wikimedia.org/r/1060433

Mentioned in SAL (#wikimedia-operations) [2024-09-10T07:03:15Z] <dcausse@deploy1003> Started scap sync-world: Backport for [[gerrit:1060433|search: use the stem field when searching mul labels (T371401)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-10T07:10:40Z] <dcausse@deploy1003> dcausse: Backport for [[gerrit:1060433|search: use the stem field when searching mul labels (T371401)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

I think that all patches have been merged, most of them deployed except one which should get deployed via the train tomorrow for group1 (re-enable fine-tuning per language).

Mentioned in SAL (#wikimedia-operations) [2024-09-10T07:20:38Z] <dcausse@deploy1003> Finished scap sync-world: Backport for [[gerrit:1060433|search: use the stem field when searching mul labels (T371401)]] (duration: 17m 22s)

Change #1060429 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add tests for mul

https://gerrit.wikimedia.org/r/1060429