Page MenuHomePhabricator

Wikibase uses multiple different mappings to standardise language codes
Open, Needs TriagePublic

Description

MediaWiki has a mapping for language codes in includes/language/LanguageCode.php. Wikibase has its own mapping in repo/config/Wikibase.default.php.

Some are the same:

CodeMediaWiki and Wikibase
de-formalde-x-formal
es-formales-x-formal
hu-formalhu-x-formal
map-bmsjv-x-bms
nl-informalnl-x-informal
simpleen-simple

Some are different:

CodeMediaWikiWikibase
cbk-zamcbkcbk-x-zam
crhcrh (not changed)crh-Latn
nrmnrffr-x-nrm
roa-taranap-x-tarait-x-tara

The Wikibase mapping is only used for sitelinks in RDF (as far as I can tell). Elsewhere in RDF, they are not converted (the ticket for that is T243428). When displaying entities, the HTML lang attributes use the MediaWiki mapping. This results in the same language code being standardised in different ways.

For example: On https://www.wikidata.org/wiki/Q5296, the roa-tara.wikipedia.org sitelink has lang="nap-x-tara" and hreflang="nap-x-tara" in the HTML and on https://roa-tara.wikipedia.org/ the <html> element has lang="nap-x-tara", whereas the RDF has schema:inLanguage "it-x-tara" and schema:name "Pagene Prengepále"@it-x-tara.

These are describing the same text/page and HTML and RDF both use the same standard for language codes (BCP 47) so the language code should be the same in both places.

The function which uses Wikibase's mapping (in repo/includes/Rdf/RdfVocabulary.php) already uses LanguageCode::bcp47 (which uses MediaWiki's mapping), so perhaps Wikibase doesn't need its own mapping at all. If it needs to be possible to customise the mapping, it would probably make more sense for the MediaWiki list to be customisable.