Page MenuHomePhabricator

Support for pseudo-language "mul" to indicate multilingual content
Open, MediumPublicFeature

Description

MediaWiki should support the language code "mul" to indicate multilingual content. This could be used as the content language on multilingual wikis, to avoid using a misleading "true" language code (as meta.wikimedia.org and commons.wikimedia.org currently do: they pretend to be in english, while they are not). This could also be used to indicate the language of truly multilingual pages, like item pages on wikidata.org.

The Language object for "mul" could be derived from the Language object for english - so it would inherit english messages, english date formatting, etc. I.e. sites that start to use "mul" as their content language instead of "en" would keep working exactly as before, except that they no longer lie about their content language.

Support for "mul" would also come in handy for language fallback mechanism: "mul" could act as the default fallback fro anything (instead of "en"). This is especially interesting for the fallback mechanism in wikibase, which (unlike mediawiki's i18n system) can not assume that english language messages exist.


Version: 1.21.x
Severity: enhancement
See Also:
T34189: An interlanguage link to oldwikisource is required

Details

Reference
bz41807

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:15 AM
bzimport set Reference to bz41807.
bzimport added a subscriber: Unknown Object (MLST).

Oh, I forgot to mention: "mul" is defined by ISO 639-2. It's in the standard, not an ad-hoc custom solution. "und" for "undetermined" is also defined and should perhaps be supported by MediaWiki.

(In reply to comment #1)

Oh, I forgot to mention: "mul" is defined by ISO 639-2. It's in the standard,
not an ad-hoc custom solution. "und" for "undetermined" is also defined and
should perhaps be supported by MediaWiki.

Yes, this is already acknowledged in the bug I linked (which is different because it asks only a [fake] interlanguage prefix); however, see bug 32189 comment 25 for some implementation problems.

Using mul is not a replacement for tagging the content correctly. If there are things in multiple languages, tag each of them with correct language code. If the problem is only that we don't know the language right now, mul is not a correct code.

The templates and commons and the page translation system are tagging the language correctly, and I think this covers big portion of the non-English content in those wikis. Marking the rest with code mul would actually be a huge regression, since I suppose majority of the remaining content is actually in English.

(In reply to comment #3)

Using mul is not a replacement for tagging the content correctly.

Yes, and that's not what I'm suggesting.

Ok, here's the use case. Consider a wikidata page with labels in 20 languages. Each label will be tagged in the HTML code with the correct language and directionality attributes. However:

  • What do we put into the HTTP Content-Language header? I think "mul" would be correct there. Similarly, when supplying DC meta data about a page (as the OAI extension does), there is only one language code that can be provided, so "mul" would be the correct choice.
  • More urgently, we need a code that we can use as a general fallback - Many things (especially people and places, i.e. over 50% of the content of wikipedia and thus wikidata) have "native" versions of their name that should act as a fallback for all other languages - and that name is indeed the correct one for *multiple* languages. I doubt that there are renderings for the town "Rackwitz" in languages other than German. Having to set this string redundantly for 300 languages makes no sense to me. So, setting "Rackwitz" as the value for the "mul" code, and falling back on this, makes sense. Using "en" for this purpose would be grossly misleading, especially in cases where the "native" form is not using latin characters (say, Руза).
  • Also, what language to we announce for the entire wiki? The API, site matrix, etc can tell you which wiki has which content language. Using "en" for multilingual wikis is annoying. I see no reason not to get rid of that lie.
  1. Do we need to use that header at all? I've never seen it used. Could be the interface language code.
  1. That seems like internal design leaking to the interface. Can't you just make it possible to designate one language to be the default in the interface? The way you store that information internally can be language code mul, but doesn't need to be.
  1. That is valid point, but that could be implemented via a new config option $wgMultilingualWiki, which at first stage would change the language code in API and other places to mul.

Relevant Wikidata discussion: https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#Special_language_code_.22mul.22.2FTranslingual

I believe this bug is highly related to bug 36430, if not a duplicate of it. mul is just the top of the language fallback pyramid.

(In reply to comment #4)

Ok, here's the use case. Consider a wikidata page with labels in 20 languages.
Each label will be tagged in the HTML code with the correct language and
directionality attributes. However:

  • What do we put into the HTTP Content-Language header? I think "mul" would be

correct there. Similarly, when supplying DC meta data about a page (as the OAI
extension does), there is only one language code that can be provided, so "mul"
would be the correct choice.

Interface language, which is also the language of the title of the page. This is a list of words in many languages, but the list itself is in one language however little linguistic content it has.

  • More urgently, we need a code that we can use as a general fallback - Many

things (especially people and places, i.e. over 50% of the content of wikipedia
and thus wikidata) have "native" versions of their name that should act as a
fallback for all other languages - and that name is indeed the correct one for
*multiple* languages. I doubt that there are renderings for the town "Rackwitz"
in languages other than German. Having to set this string redundantly for 300

You lost your bet: Serbian rendering is Раквиц in Cyrillic or Rakvic in Latin. Languages that use Latin alphabet generally just reuse the name; languages that use other alphabets generally transliterate or transcribe the name; Chinese have to even translate it.

You owe me one beer at the Bavaria pub at the destroyed church; John will tell you why is this bad for you :)

languages makes no sense to me. So, setting "Rackwitz" as the value for the
"mul" code, and falling back on this, makes sense. Using "en" for this purpose
would be grossly misleading, especially in cases where the "native" form is not
using latin characters (say, Руза).

But I fully agree with this, and I would even go one step further: use the code mul-de (multilingual content of German origin). This would allow us to in some cases automatically convert and display the names in languages that use different alphabets.

(In reply to comment #5)

  1. Do we need to use that header at all? I've never seen it used. Could be the

interface language code.

It's good practice to send it. But I guess you are right - for a multilingual site, it could and perhaps should be the interface code. I think currently, MediaWiki always sends the content language.

  1. That seems like internal design leaking to the interface. Can't you just

make it possible to designate one language to be the default in the interface?
The way you store that information internally can be language code mul, but
doesn't need to be.

We can't specify that globally, and asking users to specify it for every item is likely to cause a mess.

But I don't understand what you mean by "leaking into the interface." "mul" would be handled similarly to "qqx" - there's no leaking there, right?

But I do think we need to be able to create a Language object for mul. For instance, Title::getPageLanguage() and Content::getPageContentLanguage() return a Language object. I do want to return "mul" for that for Wikidata - because the page content *is* multilingual.

  1. That is valid point, but that could be implemented via a new config option

$wgMultilingualWiki, which at first stage would change the language code in API
and other places to mul.

But to do that, I again meed to be able to construct a Language object for mul, am I not?

(In reply to comment #6)

  • What do we put into the HTTP Content-Language header? I think "mul" would >

Interface language, which is also the language of the title of the page.

Makes sense for Wikidata. Probably not for Wikipedia. If you browse the German language Wikipedia with an English interface, would you consider the content language to be English?

I doubt that there are renderings for the town "Rackwitz"
in languages other than German. Having to set this string redundantly for 300

You lost your bet: Serbian rendering is Раквиц in Cyrillic or Rakvic in Latin.

Duh! Wikipedia never ceases to amaze me. 19 Languages! For a town that doesn't even have a gas station or a hair dresser!

Anyway. There's quite a few places that we don't have translations or transliterations for.

You owe me one beer at the Bavaria pub at the destroyed church; John will tell
you why is this bad for you :)

I owe you a beer, but let me pick the pub :)

But I fully agree with this, and I would even go one step further: use the code
mul-de (multilingual content of German origin). This would allow us to in some
cases automatically convert and display the names in languages that use
different alphabets.

Oh, nice idea!

(In reply to comment #7)

(In reply to comment #6)

  • What do we put into the HTTP Content-Language header? I think "mul" would >

Interface language, which is also the language of the title of the page.

Makes sense for Wikidata. Probably not for Wikipedia. If you browse the German
language Wikipedia with an English interface, would you consider the content
language to be English?

Of course, German.

Offtopic, for http://commons.wikimedia.org/wiki/Hauptseite I would expect it to be German as well. Perhaps a parser function that can change that?

I doubt that there are renderings for the town "Rackwitz"
in languages other than German. Having to set this string redundantly for 300

You lost your bet: Serbian rendering is Раквиц in Cyrillic or Rakvic in Latin.

Duh! Wikipedia never ceases to amaze me. 19 Languages! For a town that doesn't
even have a gas station or a hair dresser!

Anyway. There's quite a few places that we don't have translations or
transliterations for.

But in quite a few cases we can automatically convert them. A German-Serbian transliterator would probably be able to cover 90% of German cities correctly, including Rackwitz. Serbian-Macedonian or Japanese-Serbian transliterator could work perfectly.

You owe me one beer at the Bavaria pub at the destroyed church; John will tell
you why is this bad for you :)

I owe you a beer, but let me pick the pub :)

C-Base it is, then ;)

But I fully agree with this, and I would even go one step further: use the code
mul-de (multilingual content of German origin). This would allow us to in some
cases automatically convert and display the names in languages that use
different alphabets.

Oh, nice idea!

Yes, it is also required for the transliterator to work.

(In reply to comment #8)

But in quite a few cases we can automatically convert them. A German-Serbian
transliterator would probably be able to cover 90% of German cities correctly,
including Rackwitz. Serbian-Macedonian or Japanese-Serbian transliterator could
work perfectly.

Offtopic: I would also like to do this for usernames.

[replacing wikidata keyword by adding CC - see bug 56417]

I would love to see such a thing as a "multilingual wiki", like Meta or Commons or Wikidata are. But I guess there are so many things to do to specify correctly what is really a multilingual wiki that one or many RFC should be drafted.

(The Translate extension helps a bit to improve the current situation on Meta and Commons by tagging correctly more pages.)

I don't think that toponyms or given names should be modelled with "mul" (they can have multiple original languages). But I do support introducing "mul" for taxon names and music album titles, for example.

It seems to me that we have different things mixed in here that should probably be separate: 1) support for a whole wiki to be set to "mul" 2) support for putting labels into Wikidata with language code "mul". I think 2 can be done more easily and should probably get it's own task. Thoughts?

It seems to me that we have different things mixed in here that should probably be separate: 1) support for a whole wiki to be set to "mul" 2) support for putting labels into Wikidata with language code "mul". I think 2 can be done more easily and should probably get it's own task. Thoughts?

Anyone willing to reply and clarify the scope of this task? Thanks.

The main intention is to have Wikidata Labels and Aliases in mul.

I would expect that Wikilambda will also lead to Wiki-pages that are best classified as mul but this can be it's own task when it's needed.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM