Page MenuHomePhabricator

Use the same list of languages for monolingual text and lexemes
Open, Needs TriagePublic

Description

As an editor, I want to use the same language codes when moving between "monolingual text statements" and "lexemes" in order to have a move predictable/smooth workflow.

As a data reuser, I want the same language codes to be used on "monolingual test statements" and "lexemes" in order to have a more consistent representation of data.

Problem:
Currently, "monolingual text statements" and "lexemes" use different lists of languages, resulting in different language codes being used for the same language.

The different language codes can result in an inconsistent representation of data, and make it difficult for users working between the "monolingual text statements" and "lexemes".

This can also cause confusion and frustration for the editors when they can enter data for one but not the other because the same `language codez is not supported in both places.

Example:
As all 'monolingual text statements' could also be lexemes, the same language codes should be used for both "monolingual text statements" and "lexemes".
Merging the lists for "monolingual text statements" and "lexemes" so that they use the same language code could make for a better user experience for both editors and reusers.

Acceptance criteria:

  • The lists for "monolingual text statements" and "lexemes" are merged so that they use the same language codes

Notes
List of Lists of Languages

Orginal ticket

Currently, monolingual text statements and lexemes have separate lists of additional languages.

There are multiple monolingual text properties designed for use on lexemes. Therefore all lexeme language codes should be usable for monolingual text statements.

By definition, monolingual text statements include text. If we can represent something as a monolingual text statement, it contains content which could have lexemes. Therefore all monolingual text language codes should be usable for lexemes.

Advantages of combining the two lists:

  • More consistent data representation - right now we have to use one language code in some situations and another in others.
  • More predictable for users - users don't expect language codes to sometimes work and sometimes not work.
  • Easier to maintain - there would be fewer lists of languages to update.

Potential issues:

  • What about special language codes which aren't for a particular language? Monolingual text allows und, mis, mul and zxx. We already have mis for lexemes but what about the others?

This would be one way to solve T320887

Event Timeline

I think there's an argument for allowing und, particularly for etymology. Sometimes a word is given but it's not clear which language is actually intended.

I don't think mul or zxx make sense on lexemes. Anything that isn't specific to a language is more conceptual and that sort of stuff belongs on items. However, the advantages of merging the two lists are big enough that I think it should be done even if means allowing those two for lexemes.

jhsoby subscribed.

+1, I strongly support this – having looked at most of the new language codes that have been added through the years, I have yet to come across a language that makes sense for one but not the other (with the possible exception of the special ones Nikki mentions, but that should be easy enough to solve – just make WikibaseLexeme.mediawiki-services.php's $additionalLanguages equal WikibaseContentLanguages.php's getDefaultMonolingualTextLanguages() and then unset the special ones).

I think all additional language lists should be killed in favor of language-data.

I think there's an argument for allowing und, particularly for etymology. Sometimes a word is given but it's not clear which language is actually intended.

I don't think mul or zxx make sense on lexemes. Anything that isn't specific to a language is more conceptual and that sort of stuff belongs on items. However, the advantages of merging the two lists are big enough that I think it should be done even if means allowing those two for lexemes.

For mul, see also https://en.wiktionary.org/wiki/Wiktionary:About_Translingual. Note that we have items for each Unicode characters but the following is useful:

  • Abbreviations and codes, especially those with multiple meanings
  • Symbols and punctuation with multiple meanings

Change 974656 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/WikibaseLexeme@master] Support all monolingual text languages for Lexemes

https://gerrit.wikimedia.org/r/974656

Change 974656 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Support all monolingual text languages for Lexemes

https://gerrit.wikimedia.org/r/974656

As part of T341409 this has been (mostly) done. WikibaseLexeme, for backwards compatibility, still supports the following language codes which we don't support for monolingual text values:

bat-smg
be-x-old
de-formal
es-formal
fiu-vro
hu-formal
nl-informal
roa-rup
simple
zh-classical
zh-min-nan
zh-yue