Page MenuHomePhabricator

Add more hard-to-find languages to the ULS search box
Closed, ResolvedPublic

Description

Gerrit patch 386158 made it easier to search for a few languages in the ULS search box. There are still some cases that are hard to find there, however. This task will list these languages with explanations. This will mostly be based on analyzing the "no-search-results" event in event logging. The description may get updated every now and then based on new data.

(Meta-comment about tagging projects and people: 1. This is a general ULS problem, but tagging Compact Links, because it is one of the most visible areas at the moment. 2. I added some people who may be interested in this issue, or who may have some input. If you are not interested, please unsubscribe and accept my apologies for the spam.)

Transliterated and alternate autonyms

  • hay, hayeren -> Armenian (hy). "Hayeren" is the Latin transliteration of the autonym of the Armenian language. (This is similar to "Kartuli", the transliterated autonym for Georgian, which was already added, and should be easy to fix.) Gerrit patch
  • qartuli -> Georgian (ka). Like Kartuli. Gerrit patch
  • nihongo -> Japanese (ja). This is the Latin transliteration of the Japanese autonym. Gerrit patch
  • castellano -> Spanish (es). This is a common variant name for Spanish, and it doesn't appear in the data. (The closest thing that we do have is 'castelán' => 'es-es', but this is not 'es', so it cannot actually be found.) Gerrit patch

Languages with script variants and redirects

  • каз -> Kazakh (kk). It works correctly with "kaz" (from English) and with "қаз" (the correct Kazakh spelling), but not with "каз", which is either the Russian spelling or the incorrect Kazakh spelling. The letter "қ" is the appropriate letter to use in the Kazakh alphabet, but perhaps some people have a hard time typing it, and type "к" instead. This one is a bit strange, because "kazakh", "казахский", and "қазақ" all appear in the data.
  • az, azer, azerba -> Azerbaijani (az). This one is also strange, because "azerbaijani" appears in the data. Occasionally this finds South Azerbaijani (azb), but this is not enough, because these languages are similar in speech, but completely different in writing. This requires some debugging. Searching by "azərb" does work. This is the correct spelling in the Azerbaijani language itself, but it's possible that some people cannot type the letter "ə".
  • аз, азер -> Azerbaijani (az). This is similar to the above, but with Cyrillic.
  • srpski - Serbian (sr). Currently it either finds Serbo-Croatian (sh), which is a separate Wikipedia, or doesn't find anything at all. It must find Serbian (sr). Possibly related to T121747. (Related pull request)
  • punjabi -> Punjabi Western, Punjabi Eastern (both pnb and pa/pa-guru). Gerrit patch 386158 made it possible to find both pnb and pa in the languagesearch API, but pa still doesn't appear in the frontennd, probably because it's a redirect.

Other

  • English -> Simple English (simple; in the future, en-simple). English (en) is found, but Simple English (simple, en-simple) must also appear in the results when searching for English.
  • Banyumasan, Ngapak -> Banyumasan (map-bms). This language does not have a standard language code, so it doesn't appear in our data. It also doesn't appear as a missing language in the event logging about search, but it was mentioned at T132021 by @Nikola_Smolenski as a language that cannot be found. Perhaps it should also appear in the results when people search for Javanese (jv), because according to Wikipedia it's a dialect of Javanese, but this must be checked.
  • in -> Indonesian (id). Not high priority, because "indo" etc. does find it, but would be nice to fix. (See T132021.)

Follow up work is described at T186781

Event Timeline

Amire80 triaged this task as High priority.Oct 29 2017, 8:59 AM
Amire80 moved this task from Backlog to Missing languages on the ULS-CompactLinks board.
Amire80 updated the task description. (Show Details)
Amire80 updated the task description. (Show Details)
Amire80 updated the task description. (Show Details)
Amire80 added a subscriber: Nikola_Smolenski.

Change 404799 had a related patch set uploaded (by Nikerabbit; owner: Vagrant Default User):
[mediawiki/extensions/UniversalLanguageSelector@master] Add aliases for Georgian, Armenian, Spanish, and Japanese

https://gerrit.wikimedia.org/r/404799

Change 404799 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Add aliases for Georgian, Armenian, Spanish, and Japanese

https://gerrit.wikimedia.org/r/404799

https://github.com/wikimedia/jquery.uls/pull/275 seems to fix some of the issues in the "redirects" section.

I have tried using the same logic as that pull request does, and all languages under "Languages with script variants and redirects" are found. Pull request needs to be adopted to latest code changes, though.

Github PR fixed the languages in "Languages with script variants and redirects" section. I leave testing and checking the boxes in description to @Amire80.

Problems like T142971 are still present if both sr and sr-cyrl language codes are found by search and present in list of available languages, like on mediawiki.org.

Change 409064 had a related patch set uploaded (by Petar.petkovic; owner: Petar.petkovic):
[mediawiki/extensions/UniversalLanguageSelector@master] Update jquery.uls to 4cb4fe2

https://gerrit.wikimedia.org/r/409064

Change 409064 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Update jquery.uls to 4cb4fe2

https://gerrit.wikimedia.org/r/409064

All the most common cases listed in the current task description are fixed. I tested this in production on the Catalan Wikipedia. Tomorrow this is supposed to go out to all the other Wikipedias.

Enormous thanks for fixing this.

Hoping for no regressions... :)