Page MenuHomePhabricator

If " Category:<search keyword>+s " exists, make it appear first in search results
Open, MediumPublicFeature

Description

Feature summary
try https://commons.wikimedia.org/wiki/Special:Search or https://commons.wikimedia.org/wiki/Special:MediaSearch , both have the same problem.

try searching " typewriter " or " nurse ", etc.

for Special:Search , untick file namespace and/or gallery.
for MediaSearch , click "Categories and Pages".

you will have a hard time finding Category:Typewriters or Category:Nurses . lots more other cats appear before these.

Solution
whatever the keyword is, " Category:<plural form> " is usually where files about that keyword are categorised. the plural form is often the -s form.

i guess "search results sorted by relevance" are sorted by some kind of weight? increase its weight if " Category:<search keyword>+s " exists and make it appear as top as possible.

Benefits
users want to go to the category. give them that. dont waste users' time.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Note that in both cases you get 384 results—they are the same results. Ranking is the problem. We prefer exact matches, so, in the typewriter(s) case, there are lots of individual typewriter brands/models that are ranked higher than the plural category. Other categories address this by having redirects from the singular to the plural, so Category:Dog is a soft redirect to Category:Dogs. (On the other hand, Category:Cat and Category:Cats are both disambiguation pages, but also have images on them anyway.)

The +s rule is also much too simple. It doesn't work for plenty of words—most words that end in s, z, sh, ch, or y, names, and most words in many languages that are not English. It can also generate (not even plural) false positives like bas (as in the name, or bas-relief) → bass, or Fungu (an island) → fungus. Also, on Commons, we shouldn't just solve this problem for English.

Off the top of my head, some possible better approaches might be:

  • Some sort of lightweight alias system so Category:Typewriters can match Category:Typewriter, Category:Machine à écrire, Category:Schreibmaschine, etc. without a billion redirects. Look at Category:Felis_silvestris_catus (the real category for cats, the animals)—there are a ton of "vernacular names" there, and they should ideally all match "Category:<x>" by being specified on the page, as long as there isn't another existing category with that name.
    • MediaSearch does match for simple nouns like this because it makes a round trip to Wikidata, so importing automatically from Wikidata when possible could also help.
  • Some sort of boost on categories or maybe titles in general (and possibly not just on Commons) when the entire query matches the entire category/title name.

Note: I'm thinking mostly about Special:Search here; I haven't thought about MediaSearch as carefully, but some of the same issues apply, I'm sure.

so, in the typewriter(s) case, there are lots of individual typewriter brands/models that are ranked higher than the plural category

Category:Royal typewriters
Category:IBM typewriters
Category:Index typewriters
Category:Japanese typewriters
Category:Typewriters in the Musée des Arts et Métiers
Category:Blickensderfer typewriters
Category:Chinese typewriters
Category:Hermes typewriters
Category:Sholes & Glidden typewriters
Category:Erika typewriters
Category:Mignon typewriters

all these cats with the plural form appear before Category:Typewriters . none of them is a better "exact match" than Category:Typewriters.

commons mostly dont deal with non english plural form.

bumping the weight/rank is not the same as sending it to the top. i imagine a bit more weight given to "Category:Typewriters" is enough to send it over all other "xx typewriters". as long as it appears close to the top, it's a good enough solution. now it's on the second page, or forces the user to change the search keyword to plural, or do some other manoeuvres, to get to the right category.

on the other hand, category redirects should not appear in search results. T317586

if you try "nurse", it's even worse. https://commons.wikimedia.org/w/index.php?title=Special:Search&limit=100&offset=0&ns14=1&search=nurse

Category:Nurses is the 102nd result.

Category:Euphemia Steele Innes (category 20th-century nurses)
Category:Laura Cobb (category Nurses from the United States)
Category:Sister Dora (category Nurses from England)
...

many of these cats appearing because their wikitexts contain the word "nurses" (in a category name) appear before Category:Nurses. any justification for this poor algorithm?

also, my idea is only bumping 1 page.

if my idea is implemented, the problem is having 1 false positive in the first 20 results for a minority of cases.

the current situation is the majority of these cases are not appearing in the first 20 results.

try testing these words https://www.talkenglish.com/vocabulary/top-1500-nouns.aspx , see where the "Category:<noun>s" appear for those that have -s plural.

another idea would be,

afaik, for MediaSearch, given a <keyword>, it not only searchs that keyword verbatim, but also labels and aliases in all languages of wikidata items that match <keyword>. i'm not sure if that's the same for Special:Search.

since you're already doing that, you can bump the commons sitelinks or https://www.wikidata.org/wiki/Property:P373 of the matched wikidata items.

MPhamWMF triaged this task as Medium priority.Apr 10 2023, 3:29 PM
MPhamWMF moved this task from needs triage to Feature Requests on the Discovery-Search board.

the algorithm probably has some serious problems.

https://commons.wikimedia.org/w/index.php?search=intitle:%22Republic+of+China+Military+Academy%22

as i'm writing, an exact match "Category:Republic of China Military Academy" appears as the last result.