Page MenuHomePhabricator

Decide on a way forward for acceptable languages for lemmas and representations
Open, Needs TriagePublic

Description

Currently, to add a spelling variant for lemmas and representations, we use a list of language codes. This list is stored separately from items, and is based on the list used by Wikimedia for the projects (Wikipedias, etc.). It is a different list from the one we use for monolingual text.

With the Lexeme, we already have people wanting to add more codes for languages or scripts (fro, cu-cyrl, etc.) and we expect this demand to increase in the near future.

Current problems:

  • People have to request for a language code individually
  • We have to maintain the list by hand

Needs:

  • Let people enter the language code that they think is appropriate for lemmas and representations
  • Avoid mistakes and wrong choices
  • Have a list that is easier to maintain
  • The process to add new languages in the list should be quick and transparent

Possible solutions:

  • instead of a separated list, take the list of item that have instance of/subclass of language
  • merge the existing list with the monolingual text list
  • ... your suggestion here

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptMay 28 2018, 8:36 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Tough questions.

Caveat: I'm not an expert (feel free to correct me), but I cam across several languages issues in my decade and half as Wikimedian.

My point of view is that the current system for the monolingual text is good but not enough for the lexemes needs. One (anecdotal) example, the 'fro' and 'frm' codes have been approved (in a week T181823, sometimes it can takes months), and they have added month ago but still not fully integrate (you need to enter the code itself, it's not possible to just add the name of the language). For items, I understand the caution and we can wait to be sure, but for lexemes this seem to slow to me.

I like the idea of « list of item that have instance of/subclass of language » but that's not enough as it didn't take into account the script.

In the end, I think the we need to go closer to the IETF BCP 47 system of tag (a world-wide standard, already used by a lot of online dictionaries and well documented, see BCP 47 text). We need several lists : one for language (ISO 639-1), one for territories (ISO 3166), one for script (ISO 15924, here the list is quite short) and maybe others subtags (dialects, variants and private uses). This is important for precisely reflect the lemmas, like French before or after the 1990 reform, German before/after the 1996 reform ; de-CH-1996 is gave as an example in the BCP 47 itself.

More importantly we need the ability to combine these lists like in the BCP 47.

Some example of why we need more flexibility and granularity:

  • the Kazakh language, was written in Arab, then in Latn, then in Cyrl and now they are going back to Latn... 4 (or 5) scripts is probably a record but it's not unusual for a language to have at least two scripts (I'm thinking of Serbian sr-Latn and sr-Cyrl, or Chinese: zh-Hant and zh-Hans).
  • one more extrem example: the Bornholm amulet has an inscription in Latin but written in runes. So if we want to model it correctly, we need the code "la-Runr". Runic latin is not common and have limited corpus but still, a wikidatian could want to work on that, and just add this word and move on (to an other strange epigraphy, like 4th century prototurkish written in Chinese characters).

My view is more a technical one (a code is valid in his format), for the social and validation part (a code is valid regarding the content encoded), I'm not sure how to best handle it (should we let entirely to the community? and can it be articulate with the LangCom?)

PS: anyhow, attention should be paid to the capitalization (ISO 639 is full lowercase, ISO 3166 is full uppercase and ISO 15924 has only the first letter in uppercase (the same letters can exists in different codes 'ca' for Catalan and 'CA' for Canada).

Pamputt added a subscriber: Pamputt.Jun 1 2018, 1:21 PM

I strongly support the Micru's proposal. Using item to identify language, dialect or whatever is plenty of advantages:

  • it is really flexible. It is possible to use dialect, sub-dialect, language, proto-language, and so on
  • it is clearly identified. One item identify uniquely only one language. Several codes may be used to identify the same language.
  • it is already available and it will be maintained by the Wikidata community
  • it avoids conflicts on the language that are included. For example: it already exists items for Serbian, Croatian, Montenegrin and Bosnian languages. They are considered by some linguists as only one language. Since some people would like to contribute in only one of these "languages", using items allow that directly. If we use code, how do we manage such case?

Also I think we should decorrelate language and script. If languages are similar to humans, then scripts are clothes. One language may use several scripts withtout modifying the language itself. So if we decide to use item to identify language, I think we have to use another item for specifying the script used by the Lexeme.

So, instead of having unreadable and incomprehensible codes, we should use two elements: one for the language and one for the script

Tough questions.

Caveat: I'm not an expert (feel free to correct me), but I cam across several languages issues in my decade and half as Wikimedian.

My point of view is that the current system for the monolingual text is good but not enough for the lexemes needs. One (anecdotal) example, the 'fro' and 'frm' codes have been approved (in a week T181823, sometimes it can takes months), and they have added month ago but still not fully integrate (you need to enter the code itself, it's not possible to just add the name of the language). For items, I understand the caution and we can wait to be sure, but for lexemes this seem to slow to me.

I like the idea of « list of item that have instance of/subclass of language » but that's not enough as it didn't take into account the script.

In the end, I think the we need to go closer to the IETF BCP 47 system of tag (a world-wide standard, already used by a lot of online dictionaries and well documented, see BCP 47 text). We need several lists : one for language (ISO 639-1), one for territories (ISO 3166), one for script (ISO 15924, here the list is quite short) and maybe others subtags (dialects, variants and private uses). This is important for precisely reflect the lemmas, like French before or after the 1990 reform, German before/after the 1996 reform ; de-CH-1996 is gave as an example in the BCP 47 itself.

More importantly we need the ability to combine these lists like in the BCP 47.

Some example of why we need more flexibility and granularity:

  • the Kazakh language, was written in Arab, then in Latn, then in Cyrl and now they are going back to Latn... 4 (or 5) scripts is probably a record but it's not unusual for a language to have at least two scripts (I'm thinking of Serbian sr-Latn and sr-Cyrl, or Chinese: zh-Hant and zh-Hans).
  • one more extrem example: the Bornholm amulet has an inscription in Latin but written in runes. So if we want to model it correctly, we need the code "la-Runr". Runic latin is not common and have limited corpus but still, a wikidatian could want to work on that, and just add this word and move on (to an other strange epigraphy, like 4th century prototurkish written in Chinese characters).

    My view is more a technical one (a code is valid in his format), for the social and validation part (a code is valid regarding the content encoded), I'm not sure how to best handle it (should we let entirely to the community? and can it be articulate with the LangCom?)

    PS: anyhow, attention should be paid to the capitalization (ISO 639 is full lowercase, ISO 3166 is full uppercase and ISO 15924 has only the first letter in uppercase (the same letters can exists in different codes 'ca' for Catalan and 'CA' for Canada).

As explained in my prevuious message, I agree we need to specify at least language and script. For the rest (country, orthography reform, ...), I think the best way to store this kind of information is to use property in the lexeme itself. The advantages of the property is it is really flexible and so we can decide a psoteriori what kind of information we want to store in one lexeme.

Let us take the example of the "de-CH-1996" code, am I supposed to use this code only for word in Swiss German followinf the 1996 reform? If the same lexeme is used both in Swiss German and in "standard German", should I use this code. If a Swiss German lexeme has not been modified by the 1996 reform, should I use "de-CH-1996", "de-CH" or both. All are valid and I think it will be really difficult to managed.

As explained in my prevuious message, I agree we need to specify at least language and script. For the rest (country, orthography reform, ...), I think the best way to store this kind of information is to use property in the lexeme itself. The advantages of the property is it is really flexible and so we can decide a psoteriori what kind of information we want to store in one lexeme.

Indeed, property is a good idea but how would you deal with Alptraum / Albtraum?
And why not doing both, the current tagging and properties? (same for Lexical Category and Grammatical features, the need for both as already been discussed)

Let us take the example of the "de-CH-1996" code, am I supposed to use this code only for word in Swiss German followinf the 1996 reform?

Obviously yes

If the same lexeme is used both in Swiss German and in "standard German", should I use this code.

Obviously not.

If a Swiss German lexeme has not been modified by the 1996 reform, should I use "de-CH-1996", "de-CH" or both.

de-CH (or another more precise tag), both is impossible as a lemma can only have one language (if there is two lang, it's not one lemma, it's two lemmata).

All are valid and I think it will be really difficult to managed.

Yes all codes are valid but they don't mean the same thing.

Where is the difficulty here? More exactly: how is it less difficult than properties? (see all the wars on some WD items, on Wiktionaries or among the scholars, no reason that this will be magically solved on Wikidata, either by the codes or by properties)

In T195740#4248545, @Micru wrote:

@Lea_Lacroix_WMDE Why do we need to use a a list of language codes at all? Why not to do like with units and let the user select any item, and then have it checked with the constraints?

The unit seems to me the worst possible example: the suggestion gives you everything, not just units (this is very confusing for the new editors, we frequently have question about that), the result is often a mess and wikidatians spend a lot of times to build and solve the constraints. I would prefer not to waste more time (especially as there will be quickly much much more lexemes than there is units values).

I am in general favorable to Micru's proposal, and perhaps Pamputt's elaboration of it above: using wikidata items directly allows representation of the lemma language naturally in the user's own script/language for one, and other automatic bonuses of using items given the structured data ethos etc.. However I'm a little confused about the details of how this would work - specifically, the most commonly used lexemes would usually have the same spelling, use etc. across all variants of a language; do we give that a more general language ("en" = Q1860 say) and only use the specific items mentioned ("en-US" = Q7976, "en-GB" = Q7979, "en-CA" = Q44676, etc.) where there really are variations? Or would it be possible to attach multiple language items to a single lexeme, to indicate it applies to several specific variants?

Ltrlg added a subscriber: Ltrlg.Jun 4 2018, 9:13 AM

Indeed, property is a good idea but how would you deal with Alptraum / Albtraum?

I would say Alptraum and Albtraum should be two Lexemes, not only one. And Form in Alptraum could be "Alptraums", "Alpträume", "Alpträumen" and so on. And the same for "Albtraum" with "Albtraums", "Albträume", "Albträumen" and so on.

And why not doing both, the current tagging and properties? (same for Lexical Category and Grammatical features, the need for both as already been discussed)

About the other points, I already explained why I think Q-ID is the best solution (@ArthurPSmith wrote it differently but this is the same idea). And properties can help in the other cases you showed. Using both systems is only confusing without adding any advantage.

No decision have been made so far, so the ticket is not ready to be closed yet.

Vvjjkkii renamed this task from Decide on a way forward for acceptable languages for lemmas and representations to 35baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot renamed this task from 35baaaaaaa to Decide on a way forward for acceptable languages for lemmas and representations.
CommunityTechBot added a subscriber: Aklapper.