Page MenuHomePhabricator

Search on commons for a language without stemming misses the `description` field
Closed, ResolvedPublicBUG REPORT

Description

If I search for "dog" on commons here is the query I get

Notice that one of the match queries is on description.en with boost=0.019

If I search for "chó" (the Vietnamese for dog) on commons | here is the query I get

Notice that there is no match query description.vi, because we don't have stemming for Vietnamese. description.vi.plain is present in the query, but its boost is set to zero

This means that search will be less accurate for languages for which we don't have stemming (e.g. Vietnamese, Cebuano, Bengali)

Proposed fix:

  • if the boost for a language-aware non-stemming field is zero AND there is no stemmed version of the field, then set its boost to the equivalent value for the stemming field

Event Timeline

blocked by https://phabricator.wikimedia.org/T280368

The best way to do this might actually be to re-do the logistic regression using only the non-stemmed fields ... needs some discussion

@matthiasmullie

Or ... better still - use a dismax of a field and its plain version when creating the query

Change 710060 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/WikibaseMediaInfo@master] Deal with boosts on stemmed fields for non-stemmed languages

https://gerrit.wikimedia.org/r/710060

Change 710060 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Deal with boosts on stemmed fields for non-stemmed languages

https://gerrit.wikimedia.org/r/710060

Etonkovidova subscribed.

Checked on commons wmf.1
The search for chó`

  • the boost values match in all corresponding fields for search for dog and for chó
  • descriptions.vi.plain is present with boost 0.01932
  • descriptions.en is present with boost 0.015649

The top search results seem to be unaffected:

commons wmf.23commons wmf.1
Screen Shot 2021-09-16 at 5.23.12 PM.png (788×1 px, 1 MB)
Screen Shot 2021-09-23 at 11.46.58 AM.png (1×2 px, 3 MB)