Currently, the language_code field in canonical_data.wikis provides a Wikimedia-specific code which almost always matches up with a standard ISO 639 two or three letter code. @CMyrick-WMF has cataloged the divergences in this spreadsheet.
The dataset should add a new field which provides only an ISO 639 language code (language_iso_code?)
We can base this on @CMyrick-WMF's language.ipynb in the incubator-data-exploration repo.
Understanding the "Wikimedia language code"
The weird thing is that it's not really clear where language_code comes from or what exactly it represents.
Looking at meta:Special language codes, among "subdomains that do not match their lang attribute", we sometimes have the subdomain (e.g. simple, bh) and sometimes the lang attribute (e.g. lzh, rup). Among "subdomains that do not conform to a valid ISO 639 language code", again, sometimes we have the subdomain (e.g. cbk-zam) and sometimes the appropriate ISO 639 code (e.g. sgs).
We pull language_code from Meta-Wiki's sites table (which I believe is related to the SiteMatrix extension), but where does that come from?
It's particularly weird, since at some point in early September 2023, 13 of these codes changed from their previous non-standard values to the standard ISO 693 codes, and I can't figure out what change caused it.
The addWiki maintenance script calls the populateSitesTable script, but that populates the new wiki's table from Meta-Wiki, leaving unanswered the question of how new sites are added to those tables.
Files that seem to be involved:
- langlist in operations/mediawiki-config
- LanguageCode.php in mediawiki/core, which converts between "internal" and BCP-47 codes, with special provision for deprecated and non-standard codes
- SiteMatrix.php in mediawiki/extensions/SiteMatrix, which loads codes from langlist and converts them with the code in LanguageCode.php
- wgLanguageCode in wmf-config/InitialiseSettings.php in operations/mediawiki-config, which defines language codes for some non-straightforward wiki, but relies on "$lang" as the default, and I'm not sure what that refers to
- siteFromDB in includes/config/SiteConfiguration.php in mediawiki/core, which I think is what extracts $lang from the wiki ID/database code in straightforward cases.