[[ https://docs.google.com/document/d/1aWPNBVCCtkvmW5cIy2QtinUrWnYbqm7hu3puTaKeMXk/edit# | From Markus's review ]]:
It should be explained how exactly the language tag is obtained from the stored language code as found in JSON dumps. Unfortunately, Wikimedia language codes do not always match what the rest of the world is using, so there is some translation needed.
- Clarification: Wikimedia language tags. The official structure of language tags that everybody is using is defined in BCP 47 (http://www.rfc-editor.org/rfc/bcp/bcp47.txt). I know of two places that define known tags:
- The language tag registry of IANA: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
- ISO639-3: http://www-01.sil.org/iso639-3/codes.asp
I don’t know if IANA and ISO agree in all cases, IANA seems to be updated regularly. Wikimedia uses similar tags but sometimes not with the IANA/ISO meanings. The main exceptions are documented here: http://meta.wikimedia.org/wiki/Special_language_codes .
Some are critical there (e.g., Wikimedia uses “als” to encode Allemanisch, but ISO&IANA use this code for Tosk Albanian).
Nevertheless, the Meta page on language exceptions might not always give the best choice. For example, Wikimedia’s “cbk-zam” does not exist in the registries, but BCP 47 has a mechanism for extending existing this case: this would suggest the use of cbk-x-zam. The Meta page suggests to use cbk instead, which would mean that the “zam” information is forgotten. This is maybe not a big problem since Wikimedia uses no other language variant of cbk, but it is a problem for things like “de-formal” and “nl-informal”. The languages could be encoded as “de-x-informal” and “nl-x-informal”. Encoding them as “de” and “nl” would make the data indistinguishable in RDF.
Wikidata Toolkit has a replacement table for Wikimedia language codes that compiles some of my insights there: https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/interfaces/WikimediaLanguageCodes.java (I don’t claim that this is fully current though). A better approach would be to encode only the exceptions.