Page MenuHomePhabricator

Include language variant conversion and transliteration
Closed, ResolvedPublic

Description

If you go to http://wikidata-test-repo.wikimedia.de/wiki/Data:Q2?uselang=sr you can see the label "Хелијум". If you go to http://wikidata-test-repo.wikimedia.de/wiki/Data:Q2?uselang=sr-el I expect you should see the label transliterated to Latin alphabet as "Helijum" but there is no label at all. MediaWiki interface is transliterated correctly.

Interestingly, if you try to enter the label, you get "Database query error" instead of "Unrecognized value for parameter 'language': sr-el" that might be expected.

In my opinion, the correct way of handling this is that it should be possible to enter the labels in language variants, and that these labels should be displayed if they exist; if they do not exist, they should be made by the automatic conversion to the variant from the "main" label.


Version: unspecified
Severity: enhancement

Details

Reference
bz37461

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:25 AM
bzimport set Reference to bz37461.

Perhaps this should be expanded a bit and on Wikidata some languages might be viewed as variants of each other. For example, simple and en could be viewed as variants of each other (here of course conversion is not needed). Probably also arz could be viewed as variant of ar (but not the reverse) and there may be other cases.

The common pattern is: if the label is set in the language, display the label; if not, try to find the label in the "parent" variant and, if applicable, convert it.

I could not reproduce this

"Interestingly, if you try to enter the label, you get "Database query error"
instead of "Unrecognized value for parameter 'language': sr-el" that might be
expected."

Did you mean adding sitelink? That would give a violation of database constraints as it would imply to sitelinks to a single site.

(In reply to comment #3)

I could not reproduce this

"Interestingly, if you try to enter the label, you get "Database query error"
instead of "Unrecognized value for parameter 'language': sr-el" that might be
expected."

Did you mean adding sitelink? That would give a violation of database
constraints as it would imply to sitelinks to a single site.

No, I meant the label. It didn't work yesterday, but now it works.

I see it is now possible to save a label/description in a language variant :) Just display/conversion from the "parent" variant remains to be done.

Canonical conversion is done, there are no storing of variant form.

Transliteration of variants when there is an existing label should be possible. Could be a little harder to do transliteration to a canonical form before save and then transliteration back on view.

In the case where a user sets the uselang, or choses a user lang in preferences that is a language variant, a default variant can be set. If the user chose to write in another variant of some reason then this must be detected, possibly by counting characters from each of a small set of possible character sets and then the one variant with the most hits wins. As long as no winner can be found (some characters might be used in both sets) the set default language according to global or user preferences are used.

Autocomplete in sitelinks use the action=opensearch module, which does not support variants. If variant conversion is used a second call to action=query must be done for sitelinks, either to the site itself or to the repo itself.

Variants are used for Gan Chinese (gan, http://en.wikipedia.org/wiki/Gan_Chinese_language), Inuktitut (iu, http://en.wikipedia.org/wiki/Inuktitut_language), Kazakh (kk, http://en.wikipedia.org/wiki/Kazakh_language), Kurdish (ku, http://en.wikipedia.org/wiki/Kurdish_language), Shilha (shi, http://en.wikipedia.org/wiki/Shilha_language), Serbian (sr, http://en.wikipedia.org/wiki/Serbian_language), Tajik (tg, http://en.wikipedia.org/wiki/Tajik_language), Chinese (zh, http://en.wikipedia.org/wiki/Chinese_language)

Writing in a language version could imply a load of a language specific variant detector, and only after this detector is loaded the variant written by the user can be identified. That could for example imply a first load of sitelinks in wrong version. Strings for fields like label and description only needs a detection of variants on save, and will be posted to the api with a variant hint. This hint is only used if detection of the variant is unconclusive. It should probably be possible to override the automatic detection and enforce a specific variant.

As of right now on the demo, variant form is stored, but conversion is not done.

I see no reason to transliterate to a canonical form on save. Rather, it should be possible to save in a variant form. If a variant form exists, it is displayed, and if not, it is transliterated from the canonical form (or perhaps another variant if there is no canonical form).

It is in general not possible to use counting of characters because some variant forms might be the same - for example in Serbian abbreviations are often left in Latin alphabet, so something like "SCADA" is correct "Cyrillic" text. Perhaps a warning could be issued but nothing more than that. Besides, this character check is not done for languages that have no defined variants (for example, it is possible to save Japanese text in Romaji).

Just a note to say that Liangent has applied to GSoC with a proposal related to this report. Good luck!

https://www.mediawiki.org/wiki/User:Liangent/wb-lang

GSoC "soft pencils down" date was yesterday and all coding must stop on 23 September. Has this project been completed?

If you have open tasks or bugs left, one possibility is to list them at https://www.mediawiki.org/wiki/Google_Code-In and volunteer yourself as mentor.

We have heard from Google and free software projects participating in Code-in that students participating in this programs have done a great work finishing and polishing GSoC projects, many times mentores by the former GSoC student. The key is to be able to split the pending work in little tasks.

More information in the wiki page. If you have questions you can ask there or you can contact me directly.

  • This bug has been marked as a duplicate of bug 36430 ***