Page MenuHomePhabricator

[Task] Ensure that language tags generated in RDF output are standard language names
Closed, ResolvedPublic1 Estimated Story Points

Description

From Markus's review:

It should be explained how exactly the language tag is obtained from the stored language code as found in JSON dumps. Unfortunately, Wikimedia language codes do not always match what the rest of the world is using, so there is some translation needed.

I don’t know if IANA and ISO agree in all cases, IANA seems to be updated regularly. Wikimedia uses similar tags but sometimes not with the IANA/ISO meanings. The main exceptions are documented here: http://meta.wikimedia.org/wiki/Special_language_codes .

Some are critical there (e.g., Wikimedia uses “als” to encode Allemanisch, but ISO&IANA use this code for Tosk Albanian).

Nevertheless, the Meta page on language exceptions might not always give the best choice. For example, Wikimedia’s “cbk-zam” does not exist in the registries, but BCP 47 has a mechanism for extending existing this case: this would suggest the use of cbk-x-zam. The Meta page suggests to use cbk instead, which would mean that the “zam” information is forgotten. This is maybe not a big problem since Wikimedia uses no other language variant of cbk, but it is a problem for things like “de-formal” and “nl-informal”. The languages could be encoded as “de-x-informal” and “nl-x-informal”. Encoding them as “de” and “nl” would make the data indistinguishable in RDF.

Wikidata Toolkit has a replacement table for Wikimedia language codes that compiles some of my insights there: https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/interfaces/WikimediaLanguageCodes.java (I don’t claim that this is fully current though). A better approach would be to encode only the exceptions.


See also:

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)

Change 225518 had a related patch set uploaded (by Smalyshev):
T105430: canonicalize language codes

https://gerrit.wikimedia.org/r/225518

Converting the MediaWiki internal language codes to BCP 47 conform language codes is a valid task. Especially for the RDF and the HTML output. This should be done in all MediaWiki projects and extensions, not only in Wikidata.

Here some notes to some special languages

  • simple is primary a separate additional Wikipedia project in language English. It is not a separate user interface language, because uselang=simple gets converted to wgUserLanguage = en. simple has wgContentLanguage = en. Wikidata has currently no content with the language code simple and as far as I know this is unwanted. Introducing a new language code en-x-simple may be possible, when there is an use case and a community consensus.
  • de-formal and nl-informal are primary user interface languages. It describes a variant of the languages used in user communication and normally not in the content. It may be used in content for content generated user interface texts like usage descriptions. Switching the user interface language to these codes should not impact the using/editing the content. This is requested in T51024. Changing this codes to de-x-formal and nl-x-informal may be possible when this is necessary to be conform to BCP 47.

Wikidata delivers currently not the language code of the wgContentLanguage of the projects. I described this in T59706. This is still not fixed.

@Fomafix we're not talking about user interface languages here. We're talking about language specification in the RDF export - which should follow BCP 47 and common accepted language codes, otherwise third-party tools would not be able to understand in which language these strings are in. Of course, with something like "Simple English" there might not be a standard code (correct me if I'm wrong) but at least it should be one that is standard-compliant and not the same as "en", otherwise sitelink to Simple English and sitelink to English would not be distinguishable.

As far as I can see, Simple English is a separate wiki from English - I see "Search the 113,945 articles in the Simple English Wikipedia" on the homepage, so it's not the same articles. Thus, I think we need separate code for it.

Changing this codes to de-x-formal and nl-x-informal may be possible when this is necessary to be conform to BCP 47.

That's what I am doing in the patch. Along with several others that also need to be changed for standard compliance.

Not only the RDF export must have BCP 47 conform language codes, the HTML attribute lang also must have a BCP 47 conform language code. Simple already uses the correct language code ´en´ for the content (wgContentLanguage). This language code should be used for the sitelinks. In HTML and in the RDF export. When T43723 is fixed the language code for sitelinks to simple is en. If you want to change this to en-x-simple then create a separate task.

When de-formal and nl-informal are not conform to BCP 47 this should be changed. But not only for the RDF export. This must be changed everywhere where a HTML attribute lang is generated.

de-formal and nl-informal are only used in the user interface. They are not wanted as separate language for the label, description and alias in Wikidata. Therefore these language code should never occur in a RDF export.

When T43723 is fixed most of your patch for the RDF export is superfluous.

This language code should be used for the sitelinks. In HTML and in the RDF export

That would lead to the situation where links to Simple English wiki and to English wiki are indistinguishable. Which is not good.

If you want to change this to en-x-simple then create a separate task.

This is that task.

This must be changed everywhere where a HTML attribute lang is generated.

That has no relation to RDF export and thus outside of the scope of this task.

When T43723 is fixed most of your patch for the RDF export is superfluous.

When it would be fixed, we can consider revisiting this code and if the fix allows to remove the special cases then they will be removed. However, since that ticket seems to be open since 2012, I'd rather fix the RDF export now (which otherwise will be confusing for third party users - the main audience of the export) than wait for T43723.

This language code should be used for the sitelinks. In HTML and in the RDF export

That would lead to the situation where links to Simple English wiki and to English wiki are indistinguishable. Which is not good.

The sitelinks are distinguishable by the URL (https://simple.wikipedia.org/) and by the siteid (simplewiki). Show me a place where this is not enough.

If you want to change this to en-x-simple then create a separate task.

This is that task.

en is also a standard language code. If you which to change this code to an other value this is a separate task. Describe your use cases for a separate language code for the Simple projects in T27591. Such a change should be done consistently in all places, not only in the RDF export.

This must be changed everywhere where a HTML attribute lang is generated.

That has no relation to RDF export and thus outside of the scope of this task.

When you change the language codes at the right position the RDF export gets automatically the correct language codes.

When T43723 is fixed most of your patch for the RDF export is superfluous.

When it would be fixed, we can consider revisiting this code and if the fix allows to remove the special cases then they will be removed. However, since that ticket seems to be open since 2012, I'd rather fix the RDF export now (which otherwise will be confusing for third party users - the main audience of the export) than wait for T43723.

I think it is bad programming style to make several workarounds instead of fixing the core problem.

The sitelinks are distinguishable by the URL (https://simple.wikipedia.org/)

URL is the different triple than the data, and matching URLs means that the client should maintain own database which says which Wiki URL matches which language and do pattern matching on the URL data. This does not sound to me like a good solution, both performance-wise and design-wise. This also couples two things (language and URL) which should not be coupled as they describe two different things. Since we have triple using schema:inLanguage and data using language tags, we should ensure those have right values, instead of relying on other information to fix wrong values there.

en is also a standard language code.

True, but it does not adequately describes the data which refers to "Simple English", not just "English". Just as "nl" would not adequately describe data that refers to "nl-informal", etc. That would be loss of information, and we should avoid that when exporting data.

When you change the language codes at the right position the RDF export gets automatically the correct language codes.

So far I have seen no code that does such change and I do not feel comfortable starting a project for refactoring whole language handling in whole MediaWiki (which would be required if we just change Site::getLanguageCode()), when I just need a right language tag in RDF. If that refactoring ever happens and solves that particular problem, I would be glad to refactor this particular fix. But remaining without fix until an undefined moment that this happens does not sound like a good way to go for me.

I think it is bad programming style to make several workarounds instead of fixing the core problem.

If somebody volonteers to fix the core problem, I think that would be excellent. If not, as it has been evidently happening since 2012, I don't see what use it is to discuss a hypothetical fix that might have been instead of fixing the actual thing that needs to be fixed. Following this strategy would lead us only to discussing solving bigger and bigger global solutions in theory without actually getting a thing done in practice.

Please do not quote and comment parts of the sentence without the context.

I created T106367 for some special not BCP 47 conform language codes. Please add more codes there if you find some more.

I implemented this in wfBCP47() (https://gerrit.wikimedia.org/r/226040). So this would also solve the problem in the RDF export for this languages when wfBCP47() is used. Please comment the patch. Maybe there is a better position for such a mapping.

Nemo_bis set Security to None.
Jonas renamed this task from Ensure that language tags generated in RDF output are standard language names to [Task] Ensure that language tags generated in RDF output are standard language names .Aug 19 2015, 12:14 PM

Change 232751 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Fix all issues found while reviewing language code canonicalization

https://gerrit.wikimedia.org/r/232751

Change 225518 had a related patch set uploaded (by Smalyshev):
Canonicalize language codes

https://gerrit.wikimedia.org/r/225518

Change 225518 merged by jenkins-bot:
Canonicalize language codes

https://gerrit.wikimedia.org/r/225518

Change 232751 merged by jenkins-bot:
Fix all issues found while reviewing language code canonicalization

https://gerrit.wikimedia.org/r/232751