Page MenuHomePhabricator

ULS search results don't make sense
Closed, ResolvedPublic1 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

Screenshot:

Bildschirmfoto_2024-03-01_11-13-39.png (366×847 px, 45 KB)

Many of the results, including the entire first column, don't match at all. "de" does not seem to appear anywhere in the language code or the language name (either in the interface language or in the native name). Some of them don't contain any "d"s or "e"s at all, e.g. the top result tzm, "Central Atlas Tamazight", displayed as "ⵜⴰⵎⴰⵣⵉⵖⵜ", has no "d".

The best matches (those where the language code or language name starts with "de") are nowhere near the top. de-at and de-ch are not visible at all without scrolling.

What should have happened instead?:

The search results should make sense. The user should be able to understand why something is considered a match. The best matches should be at the top.

I would expect something like:

  • de, "German", "Deutsch" (exact match for language code, language name starts with the search string)
  • de-formal, "German (formal address)", "Deutsch (Sie-Form)" (exact match for the first part of the language code, language name starts with the search string)
  • de-at, "Austrian German", "Österreichisches Deutsch" (exact match for the first part of the language code, a word in the language name starts with the search string)
  • de-ch, "Swiss High German`, "Schweizer Hochdeutsch" (exact match for the first part of the language code, language name contains the search string in the middle of a word)
  • pdc, "Pennsylvania German", "Deitsch" (language name starts with the search string)
  • cbk-zam, "Chavacano", "Chavacano de Zamboanga" (a word in the language name matches the search string)
  • nds, "Low German", "Nedersaksisch" (language name contains the search string in the middle of a word)

The other 41 results have no apparent connection to what I searched for.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

It's correct according to the current algorithm. It's just not very useful ;)

The likely reason for that is that "de" appears in names of many languages in many languages, for example "amazighe de l’Atlas central" in French. See https://www.unicode.org/cldr/charts/44/by_type/locale_display_names.languages__a-d_.html#346e71a8ed4de6f8

Perhaps we could make a special case for "de", given that these are the first letter of a language that very frequently search for, and that they appear in many names of unrelated languages in a way that makes results not great. We already made some simplistic customizations and shortcuts in the search index, such as adding a "Castilian" alias (this name doesn't appear anywhere in the CLDR, but lots of people use it to search for Spanish), so making a special case for "de" sounds possible, too.

You can see the matches in the API request: https://meta.wikimedia.org/w/api.php?action=languagesearch&format=jsonfm&formatversion=2&search=de

I'd suggest avoid creating substring matches in the database for "de".

Change 1009466 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/UniversalLanguageSelector@master] Language name search: don't match on short infix terms

https://gerrit.wikimedia.org/r/1009466

Change #1009466 merged by jenkins-bot:

[mediawiki/extensions/UniversalLanguageSelector@master] Language name search: don't match on short infix terms

https://gerrit.wikimedia.org/r/1009466

It's correct according to the current algorithm. It's just not very useful ;)

The likely reason for that is that "de" appears in names of many languages in many languages, for example "amazighe de l’Atlas central" in French. See https://www.unicode.org/cldr/charts/44/by_type/locale_display_names.languages__a-d_.html#346e71a8ed4de6f8

It's searching all other languages too? That seems bizarre. The interface isn't in those languages, nor is the browser, they're not fallbacks for the current language or the target language either, they're not the autonyms, and those names aren't even being shown.

I could maybe understand searching all languages if less likely matches were further down and the matched text were shown, but right now it apparently thinks I'm more likely to be searching for the Catalan name of Minnan with the interface in English and the browser in German, than I would be searching for German by its language code or its autonym.

The link isn't working for me any more, but when I originally looked at it, it was a slight improvement (there were fewer matches), but the matches still didn't make much sense overall.

The link isn't working for me any more, but when I originally looked at it, it was a slight improvement (there were fewer matches), but the matches still didn't make much sense overall.

Fixed. There were 86 matches before, and after Niklas' patch there are 55 matches.

It's searching all other languages too? That seems bizarre. The interface isn't in those languages, nor is the browser, they're not fallbacks for the current language or the target language either, they're not the autonyms, and those names aren't even being shown.

With the language selection we wanted to provide flexibility when searching for your language. For example a user landing on Wikipedia article from a language they do not speak (e.g., following a link on social media or search results) may want to check if the content is available in a language they know. As we explore the next iteration for the language selector (T287860) we can consider to polish the approach but some level of cross-language search may still be needed.

There is potential for further improvements (trying to reduce even more articles / common terms matching in middle of language names, potentially automatically with a threshold), but that would require more effort for diminishing benefits.