Page MenuHomePhabricator

Search stemming for SDC depicts item searches
Open, Needs TriagePublicFeature

Description

Note: Technically, this could apply to all item-type inputs for claims in Wikibase, but I suspect there is a good reason why it is not implemented across universally. However, I think in the specific Commons user story described below, this is a very valid need.

Feature summary (what you would like to be able to do and where):

When entering search terms in the SDC "depicts" box on Wikimedia Commons, I should be able to enter different word forms of the same concept and still see the most relevant results.

Even if results may vary, I should never be wholly unable to select the correct "depicts" value simply because I am searching on a plural form of a noun, for example.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

  1. Enter map in the input box for "depicts" in Commons, and view the list of the suggested item values.
  2. Enter maps and view its corresponding list of the suggested item values.
  3. Observe that (1) there is no overlap in the set of suggestions, and that (2) the likely desired value "map" (Q4006) is suggested only for the map input, while users who are searching the plural noun maps will not be able to add this "depicts" statement at all.

Here are screenshots demonstrating the two searches:

Screen Shot 2022-04-26 at 1.56.53 PM.png (1×1 px, 197 KB)
Screen Shot 2022-04-26 at 2.03.37 PM.png (1×1 px, 228 KB)

Benefits (why should this be implemented?):

As a Commons user, I am being asked to use the "depicts" statements essentially as social tagging. I may not be familiar with SDC as a feature, how Wikidata items work, or concepts related to controlled vocabulary. Indeed, SuggestedTags, UploadWizard, other interface elements suggest unfamiliar users are being driven to engage with "depicts" this way. As a result, there are many reasons I might enter maps as above, while expecting to be able to find "map" (Q4006), such as:

  • I am probably more familiar with Wikimedia Commons categories, which are typically plural nouns (e.g. Category:Maps).
  • I might expect entering a plural noun (or another inflection, such as adjective or gerund word forms) will return the same relevant suggestions as the singular form, or lemma, since this is how I have been conditioned this way by general search.
  • I might be describing an image in there are multiple of the depicted entity, and so I default to searching for the plural form, not being familiar with how entities work in Wikidata.
  • I might also assume this is the preferred format, since, anecdotally, many folksonomies encourage users to tag with plural nouns as a best practice (e.g., see common Flickr tags).
  • Finally, since using plural forms for countable nouns is the standard approach in controlled vocabularies for subject headers (such as in library catalogs), I might even overthink and assume this is how Wikidata items would be named. (Or alternatively, maybe I have source metadata with plural subject headings I am copying and pasting into the search.)

Aside from causing a user to not find the item value they are looking for, this also might lead to bad data, where a user searches on a word form like maps, sees a result that feels "close enough", and decides to go with it. In this example, which inspired this request, even I, as a very experienced user, was almost tricked into selecting "cartography" (Q42515), even though this is about the study and not the object, because I thought that might be how Wikidata models the concept. Obviously, the only reason the maps search gives such a close (but erroneous) match is because the string maps does not occur in the label or description for Q4006 (and Wikidata items do not, as a rule, put all possible inflected word forms as aliases)—while maps does show up in the description for Q42515. If there was stemming applies for the maps search, I could have selected the correct result, even if both "map" and "cartography" were shown in that instance, because I could easily choose the right one when seeing them together.

Event Timeline

As an aside, the results in the example match the results given with a wbsearchentities request on Wikidata:

I don't know if this is the method SDC is using on the backend, but if this suggestion is implemented, it would also be nice as a general approach if there were a parameter in wbsearchentities to let you apply search stemming in the API search, even if this is disabled by default.