Page MenuHomePhabricator

ULS search completion for Indonesian vary on number of letters typed
Closed, ResolvedPublic

Description

Steps to reproduce:

  1. Go to https://en.wikipedia.org/wiki/Novi_Sad
    1. Observed: language link to Bahasa Indonesia appears in the sidebar.
  2. Enable Compact Language Links
  3. Go to https://en.wikipedia.org/wiki/Novi_Sad
  4. Click "66 more" (or other count) in the "other languages" interlanguage interwiki list of the sidebar.
  5. Type "i" in the search bar.
    1. Expected and observed: Links to Interlingue, Italiano and Bahasa Indonesia appear.
  6. Continue typing "in" in the search bar.
    1. Expected: Links to Interlingue and Bahasa Indonesia appear.
    2. Observed: Only the link to Interlingue appears; suggested language in the text box (for tab-completion) is Interlingue.
  7. Continue typing "ind" in the search bar.
    1. Expected: Link to Bahasa Indonesia should appear; suggested language in the text box should be Bahasa Indonesia.
    2. Observed: Link to Bahasa Indonesia does reappear, however suggested language in the text box is called "indoneyzcha".

Summary:

  • Weird behavior 1: Indonesian appears when only "i" is typed in, disappears when "in" is typed in, reappears when "ind" is typed in.
  • Weird behavior 2: Name of Indonesian is not suggested in Indonesian or English (project language) but in Uzbek.

Note: the problem also exists with Basa Jawa, but NOT with Bahasa Melay.

ULS-i.png (385×497 px, 30 KB)

ULS-in.png (379×499 px, 27 KB)

ULS-ind.png (389×490 px, 28 KB)

Event Timeline

I haven't checked, but I think this is probably because the ULS switches to the backend search API at three letters. The backend API does search in all languages etc. while first search only does simple prefix matching I believe.

For 7B there should already be a bug report but I can't find it.

Nikerabbit removed a project: ULS-CompactLinks.

I removed ULS-CompactLinks because this applies to all ULS instances as far as I know. This also does not prevent language selection, just an inconsistent and annoying behavior.

Nemo_bis renamed this task from Weird Compact Language Links behavior with Indonesian to ULS search completion for Indonesian vary on number of letters typed.Apr 7 2016, 8:28 AM
Nemo_bis added a subscriber: Nemo_bis.

I guess this could be solved if all the languages with names that have multiple words would have the words as aliases, so that they are matched by prefix. I guess Bahasa Melay is already this way.

Also, the suggested language should probably be shown in the project language, if it matches.

It seems that "indoneyzcha" was added to langnames.ser in 2014 (7e86d2b63d7cb2bbbdde33216eb3cda57ca50ceb); there is no such string in UniversalLanguageSelector/lib/jquery.uls/src/jquery.uls.data.js, but there is in CLDR extension, CldrNames/CldrNamesUz.php . I can't tell whether the data is correct.

On what to search vs. what to output, cf. T59133#598586.

I tested this for some other languages with multi-word names, with article https://en.wikipedia.org/wiki/Indonesia

  • Bahasa Banjar - works well, completion is given as "banjar dili"
  • Basa Banyumasan - can't find it at all; search for "ngapak" (local name) doesn't work also
  • Bahasa Indonesia - works like described
  • Basa Jawa - works like Indonesian, however also it is listed twice, under Asia and Pacific (Jawa is not in Pacific)
  • Bahasa Melay - works well
  • Baso Minangkabau - works well, completion is given as "minangkabaučina"
  • Basa Sunda - works well, completion is given as "sundancha"
  • Fiji Hindi - works well, completion is given as "hindcha"
  • La .lojban. - works well

Have you already checked whether these language codes are present in CLDR and whether the language-territory information is correct?
http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html

7B is not a big deal; see T158816.

7A is a bug, though not very high priority, because "ind", etc. can be found. I nevertheless listed it at T178996.

I also listed Banyumasan at T178996.

BTW, me and @Nikerabbit found a satisfactory solution for this at Wikimania 2016: in a certain regular expression, simply, instead of searching for the substring at the beginning of a language name, search for it at a word boundary. Not sure why is it not implemented.

I am fairly confident that this task is mostly fixed due to my recent improvements to the search.

I tested this for some other languages with multi-word names, with article https://en.wikipedia.org/wiki/Indonesia

  • Bahasa Banjar - works well, completion is given as "banjar dili"
  • Basa Banyumasan - can't find it at all; search for "ngapak" (local name) doesn't work also

It is now found.

  • Bahasa Indonesia - works like described
  • Basa Jawa - works like Indonesian, however also it is listed twice, under Asia and Pacific (Jawa is not in Pacific)

If you want this changed, please file it separately.

  • Bahasa Melay - works well
  • Baso Minangkabau - works well, completion is given as "minangkabaučina"
  • Basa Sunda - works well, completion is given as "sundancha"
  • Fiji Hindi - works well, completion is given as "hindcha"
  • La .lojban. - works well

In English interface language the autocompletions are now much better. I didn't test in other languages.

Thank you, everything seems to work well :)

However, there are some regressions. If you search for "bahasa" you also get Church Slavonic (?). Also, now the search for "српски" or "srpski" doesn't work at all. Should I open a new task?

Checked in wmf.12 - yes, the reported issue seems to be fixed. Regarding

If you search for "bahasa" you also get Church Slavonic (?). Also, now the search for "српски" or "srpski" doesn't work at all.

Typing "српски" or "srpski" will work until the very last letter- "и" or "i" is typed. Strangely that "Serbian" does not work too.

@Nikerabbit, @Nikola_Smolenski
There might be quite few discrepancies/missing search suggestion - for example, 'abk' for 'Abkhazian'. Maybe a ticket describing it as a general problem can be filed?

Closing this ticket as Resolved since the described issue is fixed.

Typing "српски" or "srpski" will work until the very last letter- "и" or "i" is typed. Strangely that "Serbian" does not work too.

It won't, it only shows Serbocroatian. I have now made T183051 about this.