Change Details

[[ https://gerrit.wikimedia.org/r/#/c/386158/ | Gerrit patch 386158 ]] made it easier to search for a few languages in the ULS search box. There are still some cases that are hard to find there, however. This task will list these languages with explanations. This will mostly be based on analyzing the "no-search-results" event in event logging. The description may get updated every now and then based on new data. (Meta-comment about tagging projects and people: 1. This is a general ULS problem, but tagging Compact Links, because it is one of the most visible areas at the moment. 2. I added some people who may be interested in this issue, or who may have some input. If you are not interested, please unsubscribe and accept my apologies for the spam.) **Transliterated and alternate autonyms** [ ] **hay, hayeren -> Armenian** (hy). "Hayeren" is the Latin transliteration of the autonym of the Armenian language. (This is similar to "Kartuli", the transliterated autonym for Georgian, which was already added, and should be easy to fix.) [ ] **qartuli -> Georgian** (ka). Like Kartuli. [ ] **nihongo -> Japanese** (ja). This is the Latin transliteration of the Japanese autonym. [ ] **castellano -> Spanish** (es). This is a common variant name for Spanish, and it doesn't appear in the data. (The closest thing that we do have is `'castelán' => 'es-es'`, but this is not 'es', so it cannot actually be found.) **Languages with script variants and redirects** [ ] **каз -> Kazakh** (kk). It works correctly with "kaz" (from English) and with "қаз" (the correct Kazakh spelling), but not with "каз", which is either the Russian spelling or the incorrect Kazakh spelling. The letter "қ" is the appropriate letter to use in the Kazakh alphabet, but perhaps some people have a hard time typing it, and type "к" instead. This one is a bit strange, because "kazakh", "казахский", and "қазақ" all appear in the data. [ ] **az, azer, azerba -> Azerbaijani** (az). This one is also strange, because "azerbaijani" appears in the data. Occasionally this finds South Azerbaijani (azb), but this is not enough, because these languages are similar in speech, but completely different in writing. This requires some debugging. Searching by "azərb" does work. This is the correct spelling in the Azerbaijani language itself, but it's possible that some people cannot type the letter "ə". [ ] **аз, азер -> Azerbaijani** (az). This is similar to the above, but with Cyrillic. [ ] **srpski - Serbian** (sr). Currently it either finds Serbo-Croatian (sh), which is a separate Wikipedia, or doesn't find anything at all. It must find Serbian (sr). Possibly related to T121747. ([[ https://github.com/wikimedia/jquery.uls/pull/275 | Related pull request ]]) [ ] **punjabi -> Punjabi Western, Punjabi Eastern** (both pnb and pa/pa-guru). [[ https://gerrit.wikimedia.org/r/#/c/386158/ | Gerrit patch 386158 ]] made it possible to find both pnb and pa in the languagesearch API, but pa still doesn't appear in the frontennd, probably because it's a redirect. **Special issues for Chinese and Japanese** [ ] **zhong -> Chinese** (all variants). Among the most frequent search failures. "Zhongwen" is the standard Latin pinyin transliteration for the name of the Chinese language, so it should be findable. [ ] **繁體 -> Traditional Chinese** (zh-hant). This does appear in the data, but perhaps we can optimize this in interlanguage links and take people directly to the Chinese Wikipedia in the traditional variant. (Although generalizing this for sites other than Wikipedia can be challenging.) [ ] **简体, 简体中文 -> Simplified Chinese** (zh-hans). Similar to "繁體 -> Traditional Chinese" above, but for Simplified Chinese. [ ] **にほ -> Japanese** (ja). This is the spelling of the Japanese autonym in Hiragana, which is a variant Japanese writing system. It appears surprisingly frequently in failed searches, so it should be supported. It may happen because of a race condition between a Hiragana-based IME and ULS's search algorithm, or for other reasons. [ ] **ㄓㄨ -> Chinese (?)**. This is Bopomofo, an auxiliary writing system, on which some Chinese input methods are based. It's unclear what are people trying to find when they search for it, however. [ ] **汉语 -> Chinese (?)**. This refers to Chinese //spoken// language. It's unclear what are people searching for with this string. It could be [[ https://en.wikipedia.org/wiki/Pinyin | the Pinyin transliteration system ]], the name of which begins with the same characters (汉语拼音方案), but that's not really a language in the sense that is usually used in ULS. **Other** [v] **English -> Simple English** (simple; in the future, en-simple). English (en) is found, but Simple English (simple, en-simple) must also appear in the results when searching for English. [ ] **tiêng -> Vietnamese** (vi) //or maybe something else//. The word "tiếng" means "language" in Vietnamese, so many language names in Vietnamese begin with this word. "tiêng" is a misspelling, but event logging data shows that it's very common, so our algorithm should treat it accordingly. Searching for "tiếng" finds Vietnamese, and searching for "tiêng" should do the same. We should treat differences in Latin diacritics the same way we treat other simple spelling errors. [v] **Banyumasan, Ngapak -> Banyumasan** (map-bms). This language does not have a standard language code, so it doesn't appear in our data. It also doesn't appear as a missing language in the event logging about search, but it was mentioned at T132021 by @Nikola_Smolenski as a language that cannot be found. Perhaps it should also appear in the results when people search for Javanese (jv), because according to Wikipedia it's a dialect of Javanese, but this must be checked. [v] **in -> Indonesian** (id). Not high priority, because "indo" etc. does find it, but would be nice to fix. (See T132021.) **Unresolved issues** - What is "jian"? It appears frequently in failed searches, but I don't know which language this is. (Could be [[ https://en.wikipedia.org/wiki/Jian%27ou_dialect | Jian'ou ]], but not certain.)

[[ https://gerrit.wikimedia.org/r/#/c/386158/ | Gerrit patch 386158 ]] made it easier to search for a few languages in the ULS search box. There are still some cases that are hard to find there, however. This task will list these languages with explanations. This will mostly be based on analyzing the "no-search-results" event in event logging. The description may get updated every now and then based on new data. (Meta-comment about tagging projects and people: 1. This is a general ULS problem, but tagging Compact Links, because it is one of the most visible areas at the moment. 2. I added some people who may be interested in this issue, or who may have some input. If you are not interested, please unsubscribe and accept my apologies for the spam.) **Transliterated and alternate autonyms** [ ] **hay, hayeren -> Armenian** (hy). "Hayeren" is the Latin transliteration of the autonym of the Armenian language. (This is similar to "Kartuli", the transliterated autonym for Georgian, which was already added, and should be easy to fix.) //[[ https://gerrit.wikimedia.org/r/#/c/404799/ | Gerrit patch ]]// [ ] **qartuli -> Georgian** (ka). Like Kartuli. //[[ https://gerrit.wikimedia.org/r/#/c/404799/ | Gerrit patch ]]// [ ] **nihongo -> Japanese** (ja). This is the Latin transliteration of the Japanese autonym. //[[ https://gerrit.wikimedia.org/r/#/c/404799/ | Gerrit patch ]]// [ ] **castellano -> Spanish** (es). This is a common variant name for Spanish, and it doesn't appear in the data. (The closest thing that we do have is `'castelán' => 'es-es'`, but this is not 'es', so it cannot actually be found.) //[[ https://gerrit.wikimedia.org/r/#/c/404799/ | Gerrit patch ]]// **Languages with script variants and redirects** [ ] **каз -> Kazakh** (kk). It works correctly with "kaz" (from English) and with "қаз" (the correct Kazakh spelling), but not with "каз", which is either the Russian spelling or the incorrect Kazakh spelling. The letter "қ" is the appropriate letter to use in the Kazakh alphabet, but perhaps some people have a hard time typing it, and type "к" instead. This one is a bit strange, because "kazakh", "казахский", and "қазақ" all appear in the data. [ ] **az, azer, azerba -> Azerbaijani** (az). This one is also strange, because "azerbaijani" appears in the data. Occasionally this finds South Azerbaijani (azb), but this is not enough, because these languages are similar in speech, but completely different in writing. This requires some debugging. Searching by "azərb" does work. This is the correct spelling in the Azerbaijani language itself, but it's possible that some people cannot type the letter "ə". [ ] **аз, азер -> Azerbaijani** (az). This is similar to the above, but with Cyrillic. [ ] **srpski - Serbian** (sr). Currently it either finds Serbo-Croatian (sh), which is a separate Wikipedia, or doesn't find anything at all. It must find Serbian (sr). Possibly related to T121747. ([[ https://github.com/wikimedia/jquery.uls/pull/275 | Related pull request ]]) [ ] **punjabi -> Punjabi Western, Punjabi Eastern** (both pnb and pa/pa-guru). [[ https://gerrit.wikimedia.org/r/#/c/386158/ | Gerrit patch 386158 ]] made it possible to find both pnb and pa in the languagesearch API, but pa still doesn't appear in the frontennd, probably because it's a redirect. **Special issues for Chinese and Japanese** [ ] **zhong -> Chinese** (all variants). Among the most frequent search failures. "Zhongwen" is the standard Latin pinyin transliteration for the name of the Chinese language, so it should be findable. [ ] **繁體 -> Traditional Chinese** (zh-hant). This does appear in the data, but perhaps we can optimize this in interlanguage links and take people directly to the Chinese Wikipedia in the traditional variant. (Although generalizing this for sites other than Wikipedia can be challenging.) [ ] **简体, 简体中文 -> Simplified Chinese** (zh-hans). Similar to "繁體 -> Traditional Chinese" above, but for Simplified Chinese. [ ] **にほ -> Japanese** (ja). This is the spelling of the Japanese autonym in Hiragana, which is a variant Japanese writing system. It appears surprisingly frequently in failed searches, so it should be supported. It may happen because of a race condition between a Hiragana-based IME and ULS's search algorithm, or for other reasons. [ ] **ㄓㄨ -> Chinese (?)**. This is Bopomofo, an auxiliary writing system, on which some Chinese input methods are based. It's unclear what are people trying to find when they search for it, however. [ ] **汉语 -> Chinese (?)**. This refers to Chinese //spoken// language. It's unclear what are people searching for with this string. It could be [[ https://en.wikipedia.org/wiki/Pinyin | the Pinyin transliteration system ]], the name of which begins with the same characters (汉语拼音方案), but that's not really a language in the sense that is usually used in ULS. **Other** [v] **English -> Simple English** (simple; in the future, en-simple). English (en) is found, but Simple English (simple, en-simple) must also appear in the results when searching for English. [ ] **tiêng -> Vietnamese** (vi) //or maybe something else//. The word "tiếng" means "language" in Vietnamese, so many language names in Vietnamese begin with this word. "tiêng" is a misspelling, but event logging data shows that it's very common, so our algorithm should treat it accordingly. Searching for "tiếng" finds Vietnamese, and searching for "tiêng" should do the same. We should treat differences in Latin diacritics the same way we treat other simple spelling errors. [v] **Banyumasan, Ngapak -> Banyumasan** (map-bms). This language does not have a standard language code, so it doesn't appear in our data. It also doesn't appear as a missing language in the event logging about search, but it was mentioned at T132021 by @Nikola_Smolenski as a language that cannot be found. Perhaps it should also appear in the results when people search for Javanese (jv), because according to Wikipedia it's a dialect of Javanese, but this must be checked. [v] **in -> Indonesian** (id). Not high priority, because "indo" etc. does find it, but would be nice to fix. (See T132021.) **Unresolved issues** - What is "jian"? It appears frequently in failed searches, but I don't know which language this is. (Could be [[ https://en.wikipedia.org/wiki/Jian%27ou_dialect | Jian'ou ]], but not certain.)