Page MenuHomePhabricator

[Story] Support multiple scripts for one language
Open, MediumPublic

Description

Languages like Serbian and Chinese use multiple scripts officially for their language. This is understood in Wikipedia by on the fly transliteration. This functionality is also needed in Wikidata. The complexity is different... it may mean that a choice is made to express both all the time. It makes sense to automatically transliterate the other string when a particular script is used by the user..
Thanks,

GerardM

Both the labels, the descriptions in whatever script has to be available at the same time.

Event Timeline

GerardM assigned this task to Lydia_Pintscher.
GerardM raised the priority of this task from to Needs Triage.
GerardM updated the task description. (Show Details)
GerardM added a project: Wikidata.
GerardM added subscribers: GerardM, Amire80.
daniel subscribed.

Automatic transliteration of labels is already applied as part of the language fallback mechanism, if MediaWiki supports transliteration for the respective language variants. So, if you have your user language set to sr-el, you should see the latinized version of any labels entered in sr-ec or "plain" sr. Try uselang=sr-el, you will see a lot of "serbian (transliterated)" markers in the statements.

Note that this currently applies only for labels of items and properties used as parts of statements (and it's broken for qualifiers at the moment, see T88275). It does not apply to the editable labels (and descriptions and aliases) at the top of entity pages, since it is unclear how language fallback should interact with editing.

I'm closing this as invalid, since the general functionality is already there.

If what you want is language fallback for the "main" label & description of an entity page, please file a ticket explicitly for that, it's mainly an issue of UI design. Perhaps we could just drop editing in the "top bar", and rely on the "term box" for editing? That would allow us to apply language fallback there.

Hoi,
That is not the preferred solution. It is relevant to have BOTH scripts
available, Having only one is not good enough.
Thanks,

GerardM
GerardM updated the task description. (Show Details)
GerardM set Security to None.

@GerardM having them available where, exactly? And how many? There's like six variants of Chinese, IIRC. Do you want to display them all, always?

Also, do you want them only if they got transliterated automatically, or also if given explicitly?

It kind of sounds like something that should be done in the term box. Applying fallback to the "main" label and description make sense to me if we can figure out the editing issue. Showing transliterations as suggestions the the term box would also be nice, but doing that without clutter and confusion is a challenge. Maybe someone could make a gadget for prototyping this?

daniel renamed this task from Support multiple scripts for one language to Show available transliterations for entity labels in the top area of entity pages.May 2 2015, 1:54 PM
GerardM renamed this task from Show available transliterations for entity labels in the top area of entity pages to Support multiple scripts for one language.May 2 2015, 2:34 PM

@GerardM You changed back the wording of the request to say "Support multiple scripts for one language". If taken literally, that means changing what a "language" is to mediawiki. I do not think this is realistic.

thiemowmde renamed this task from Support multiple scripts for one language to [Story] Support multiple scripts for one language.Aug 24 2015, 9:21 AM
thiemowmde triaged this task as Medium priority.

This is very important for the "monolingual text" datatype. We do not enter text in a language. We always enter text in a script. For properties where we make the datatype "monolingual text" it is because we want to know the data in that language. Where there are alternative versions in multiple scripts then it is a requirement that we note which script has been used for the value entered. This is not negotiable. the "language" tag for a "Monolingual text" value is meaningless if it doesn't specify which script version of the language that value is in. In many cases it will be more important to specify the script than the language (since there well may be multiple languages using the same script version).

If this means "redefining what a language is" then we need to get on that right now.

In MediaWiki, representations of the same language in different scripts are called "language variants". Don't take that term too seriously, it's just a name for codes that say "this is Serbian written in Cyrillic", etc. MediaWiki supports automatic transliteration between some such variants. We already use that kind of transliteration for display, and we allow the script to be specified for input, e.g. for monolingual text (if we don't, that's a simple bug). I can however imaging that it doesn't work perfectly when searching, or for some other use cases.

So, we have some support for this. I'm not sure what this ticket is actually asking for. I also don't understand why Gerard thinks the current system of "variants" is not sufficient.

If there is something related to different scripts that you think is needed, but cannot be done currently, please write it down as a story, step by step.

Do we need a separate phabricator item for "multilingual text language tag should support multiple scripts for one language (separate language tag for each script)"?

@Filceolaire: what's the problem with treating them as different languages? To the software it makes no difference. There are several codes, and text associated with each of them. Whether you call them different languages, or the same language in a different script, is just a matter of giving good names for the codes. Am I missing something?

Hoi,
They are known. It is just that the software presents them in a way that is
controlled programatically. The purpose here is to show a language in a
particular script ALL the time.
Thanks,

GerardM

@GerardM can you rephrase that? Perhaps give an example? I have no idea what you mean.

Hoi,
It is simple. When a text is in Serbian, it can be in the Latin or the
Cyrillic script. When a text is only to be shown in one language, it should
also be exclusively in one script.
Thanks,

GerardM

@GerardM That's already the case. We have separate language codes for sr-ec and sr-el. Content given for these codes is handled separately.

We should apply automatic transliteration between them for display, but I haven't checked whether this is currently working correctly.

Hoi,
We should NOT apply automatic transliteration between them for display.
THAT IS THE POINT. So we do not have this.
Thanks,

GerardM

@GerardM So if a label for Q123 is defined for sr-ec but not for sr-el, and your user language is sr-el, you do not want to see the automatically transliterated label when Q123 is mentioned? Why not? Would you rather see just the ID? How is that useful?

Note that transliteration is never applied in places where labels or descriptions are editable, e.g. in the "term box" at the top.

Or do you mean that transliteration should not be applied for monolingual text? I agree, and I don't think we do.

In Wikibase, we should not have text "in Serbian". It should either be sr-el or it's sr-ec. MediaWiki however defines a plain "sr" code (and treats it essentially as an alias for "sr-ec"). We should probably not allow this to be used for input on Wikibase, and instead force users to pick either sr-el or sr-ec.

Is this what you want? In that case, I suggest to change the title of this ticket to "Disallow use of 'sr' language code" - the problem is not that we don't support different scripts, quite to the contrary: we support an extra language code that may cause confusion. Actually, I think we have discussed this before, there may already be a ticket for it.

Hoi,
For monolingual text we should never transliterate.

When we enter data in Wikidata on any other type of field, it should be
Serbian and the transliteration should be automatic.
Thanks,

GerardM

When we enter data in Wikidata on any other type of field, it should be
Serbian and the transliteration should be automatic.

For the transliteration to work, the wiki has to know which Serbian (script) you used when entering it. So you need to specify whether you are using sr-el and sr-ec for your input.

We could try to auto-detect it, but that's a whole different can of worms, and rather scary for data input. Stas is experimenting with detecting input language for search, though.

Hoi,
Detecting a script is easy... It is within a certain block of characters
and that makes it a particular script. Not hard at all. This is script
detection not language detection.
Thanks,

Gerard

Detecting the script is about 10x harder and 10x more error prone than just picking sr-el or sr-ec explicitly, instead of picking just sr. You have to pick a language anyway.

Disallowing plain "sr" for input is an easy task, and should hopefully uncontroversial.
Auto-detecting languages or variants is a heuristic with the potential to introduce errors, has no clear scope (which languages can be detected), and needs quite a bit of scaffolding (we need to architecture a place for the detection logic to live).

Auto-detecting is nice, but I'd consider it a long term feature request.
In contrast, I consider dropping support for plain "sr" a relatively simple bug fix. The only issue may be that the community is actually (ab)using this issue as a "feature" in some way. There is an annoying tendency among users to rely on bugs, and complain when you fix them...

Hoi,
When you compare a choice made and having to detect a script I am shocked
that it is only 10 times more difficult. We do not need to store in both
Latin script and Cyrillic. We should use the MediaWiki convention and store
in Cyrillic. That is an architectural difference with any other language.
WE SHOULD NEVER STORE BOTH.

That is not a feature request.
Thanks,

GerardM

@GerardM are you saying there are no people, places, books, or movies that have Cyrillic and Roman variants of their name that diverge from the standard transliteration rules? And we should hard-code this assumption into software? Over half of our content has proper nouns as labels, and that percentage is likely to grow. I don't think this is a good idea. We already support transliteration, there is no need to enter both unless they diverge from the transliteration rules. But if they do, you *should* be able to enter them in Roman and Cyrillic separately.

In any case, the technical solution is the same in both cases: we need a blacklist for input languages. We should either disallow sr, or disallow sr-el and sr-ec. So I added T102533: [Bug] Disallow (or resolve) dummy language codes. as a blocker.

Hoi,
It is what we do in Wikipedia. It is the standard.

Do you really want to duplicate all labels for a language on this
assumption? Or are you saying that sr is the standard and the others are
when something is different in the same way as for American English,
British English etc?
Thanks,

GerardM

There are FOUR separate use cases which need to be treated differently

  1. Sitelinks. These are determined by the wikis we have. If we have a multiscript or even a multilingual wiki then the sitelinks should follow that.
  1. Labels, Descriptions and Aliases. We have agreed in principal that we should be able to have labels etc. in any language and in the different script versions of those languages. Here the program needs to be able to tell what language each label is in and - for multi script languages - what script. This needs to be combined with fallback languages, automatic transcription, automatically generated descriptions based on claims etc. In practice this does not extend to all languages. A language needs to have a been recorded in some way and be in use to some extent.
  1. Monolingual Text datatype. This is used to record actual text - official names, inscriptions, quotations etc. It is important to record exactly what language and script the text is in. The whole point of this datatype is to record the original exactly so it must never be automatically transcribed or translated (though there may be a transcription or translation in a separate qualifier claim using a 'translation' or 'transcription' property with some sort of 'multilingual text datatype). It must be possible to tag this with it's language and script even if that language or script is so rare that this inscription is the only existing example known or if the language and script was invented for one movie.
  1. Multilingual Text Datatype. This is used for comments, instructions, translations, transcriptions and should display the users preferred language. It is a lot more similar to the Labels and Descriptions that to the Monolingual Text datatype.

@Filceolaire as far as I know, all of this already works. The only problem is that we allow "sr" in addition to "sr-el" and "sr-ec", so there is three choices instead of two.

@GerardM: there is no need to duplicate anything, fallback will take care of it. You enter sr-el or sr-ec, as you like, and the other version will be shown automatically. If that doesn't work for you, please file a bug saying "transliteration between serbian variants does not work for labels" or some such.

From what Filceolaire says, it seems that it is indeed important to record explicitly whether something was entered in sr-el or sr-el, and to also allow both to be entered explicitly and separately, if desired. From what GerardM sais, it seems to me that the opposite is desired: no distinction in input or storage (everything is "sr"), and transliteration is applied on output (based on automatically detecting which script is used in the stored version).

Given that what Filceolaire wants is already what we do, and in effect leads to a similar outcome to what GerardM wants (except that you have to explicitly say whether your input is roman or cyrillic), I'm struggelign to understand what concrete problem we are trying to solve.

Please help me understand by pdescribing the issue in three sentences, answering three questions:

  1. what did you do?
  2. what happened?
  3. what did you expect to happen?

Thanks.

@daniel I agree Sitelinks mostly works. as designed. We just get strange corner cases where a wikipedia has two articles about the same topic in two different languages/dialects and they can't add sitelinks for both. I would expect there to be a way to do these sitelinks.

For Labels and descriptions there doesn't seem to be a way to start entering values in a new language or a new script for an existing language. Where is the "Add a language" pop-up?

For Monolingual text how do we show the official name in Korean-Hangul and in Korean-Hanja? How do we record the language and script of the inscription on the 'One Ring' in Lord of the Rings. I expect to be able to use any item as a language tag or a script tag so restrictions are via bots (like number datatype) and users can add a tag if they need it.

Multilingual Text datatype doesn't work. I expect it to work like Descriptions with the users preferred language displayed plus fallbacks etc.

@daniel I agree Sitelinks mostly works. as designed. We just get strange corner cases where a wikipedia has two articles about the same topic in two different languages/dialects and they can't add sitelinks for both. I would expect there to be a way to do these sitelinks.

We assume that there is only one page about one topic in a given language (i.e. wiki). This assumption is built into MediaWiki, it's not something Wikidata imposes. Without this assumption, features like automatic language links would be a lot harder or impossible to implement, since we wouldn't have a unique connection between data items and wikipedia pages.

For Labels and descriptions there doesn't seem to be a way to start entering values in a new language or a new script for an existing language. Where is the "Add a language" pop-up?

That is an annoying usability issue, I agree. I want that popup too. You can work around this problem by adding babel boxes for your favorite languages / variants to your user page. You will then see input fields for these languages when viewing an item.

For Monolingual text how do we show the official name in Korean-Hangul and in Korean-Hanja? How do we record the language and script of the inscription on the 'One Ring' in Lord of the Rings. I expect to be able to use any item as a language tag or a script tag so restrictions are via bots (like number datatype) and users can add a tag if they need it.

Yes, being able to use any item as a language would be nice, but would make it very hard to implement things like language fallback or transliteration. We do have limited support for "extra" languages that can be configured via $wgExtraLanguageNames. But allowing additional languages is a completely different issue than what was requested in this ticket.

Multilingual Text datatype doesn't work. I expect it to work like Descriptions with the users preferred language displayed plus fallbacks etc.

It's not that it doesn't work - it doesn't exist. We never implemented it, since there is no clear use case. By convention, if you have something like a city motto in two or three languages, you would add a separate statement for each language. This makes sense since the different language versions of the motto may have been adopted at different times, may be defined by different sources (such as laws), etc - so they should have different qualifiers and different references.

In summary, it seems like what you want mostly works, and the biggest annoyance is the (lack of) usability of the "term box" at the top of each item page, and its magic relationship with babel boxes. You raise several valid issues, but none of them seem to fit the description at the top of this ticket.

Hoi,

Adding NEW languages or NEW scripts to an existing language is something
that has always been done after consideration by the language committee.
For languages that are known to exist in ISO-639-3 there is approval to add
them after a request.
Thanks,

GerardM

We never implemented it, since there is no clear use case.

Is there somewhere we can add use cases? At least one of the existing monolingual text properties ("image legend") should really be multilingual text (if I've understood what multilingual text is supposed to be), entity usage descriptions (if implemented as a property, which was one of the suggestions) would presumably also be best as multilingual text and the page for pending properties lists one property waiting for multilingual text ("IUPAC name", whether multilingual text is really correct there, I don't know).

@Nikki You can add use cases for multilingual text to T86517: [Story] Add a new datatype for multilingual text

Image legend/description is indeed a use case we have discussed in the context of adding structured data to image description on commons.

FWIW, detecting sr-el and sr-ec may be easy, but distinguishing zh-hk from zh-cn is *not*. The CJK character block in particular has big overlap problems, dating back to when we were worried about having only 64k characters in unicode.

Could someone edit the phab summary to more clearly indicate what the task is here? @daniel's been working on figuring it out, but I'm still in the dark after reading the whole thread.

Hoi,
What is meant by the cn and hk? It probably has nothing to do with a script
.. a script has four characters. Consequently this is not about script in
the definition.
Thanks,

Gerard