Page MenuHomePhabricator

Provide ISO 639 language codes in canonical wiki dataset
Closed, DeclinedPublic

Description

Currently, the language_code field in canonical_data.wikis provides a Wikimedia-specific code which almost always matches up with a standard ISO 639 two or three letter code. @CMyrick-WMF has cataloged the divergences in this spreadsheet.

The dataset should add a new field which provides only an ISO 639 language code (language_iso_code?)

We can base this on @CMyrick-WMF's language.ipynb in the incubator-data-exploration repo.

Understanding the "Wikimedia language code"

The weird thing is that it's not really clear where language_code comes from or what exactly it represents.

Looking at meta:Special language codes, among "subdomains that do not match their lang attribute", we sometimes have the subdomain (e.g. simple, bh) and sometimes the lang attribute (e.g. lzh, rup). Among "subdomains that do not conform to a valid ISO 639 language code", again, sometimes we have the subdomain (e.g. cbk-zam) and sometimes the appropriate ISO 639 code (e.g. sgs).

We pull language_code from Meta-Wiki's sites table (which I believe is related to the SiteMatrix extension), but where does that come from?

It's particularly weird, since at some point in early September 2023, 13 of these codes changed from their previous non-standard values to the standard ISO 693 codes, and I can't figure out what change caused it.

The addWiki maintenance script calls the populateSitesTable script, but that populates the new wiki's table from Meta-Wiki, leaving unanswered the question of how new sites are added to those tables.

Files that seem to be involved:

  • langlist in operations/mediawiki-config
  • LanguageCode.php in mediawiki/core, which converts between "internal" and BCP-47 codes, with special provision for deprecated and non-standard codes
  • SiteMatrix.php in mediawiki/extensions/SiteMatrix, which loads codes from langlist and converts them with the code in LanguageCode.php
  • wgLanguageCode in wmf-config/InitialiseSettings.php in operations/mediawiki-config, which defines language codes for some non-straightforward wiki, but relies on "$lang" as the default, and I'm not sure what that refers to
  • siteFromDB in includes/config/SiteConfiguration.php in mediawiki/core, which I think is what extracts $lang from the wiki ID/database code in straightforward cases.

Event Timeline

@CMyrick-WMF in the wiki dataset, which ISO code should we provide, alpha-2 or alpha-3? The language dataset should have both, but for the wiki dataset we should just pick one to serve as the "foreign key" to the language dataset.

I would guess that the alpha-3 is the best option, as alpha-2 seems like it won't cover everything. Theoretically, we could have one field that contains the alpha-2 if it exists and the alpha-3 otherwise, but that seems messy and confusing.

Also, if you haven't already, we'll need to think about codes for multi-lingual wikis. "mul" seems the obvious choice for the incubators, but what about Wikifunctions, Commons, and Wikidata, where the content is fairly trans-lingual but the community activity is probably overwhelming English?

in the wiki dataset, which ISO code should we provide, alpha-2 or alpha-3? The language dataset should have both, but for the wiki dataset we should just pick one to serve as the "foreign key" to the language dataset.
I would guess that the alpha-3 is the best option, as alpha-2 seems like it won't cover everything.

Yes, I agree that the alpha-3 code makes the most sense for a foreign key, since not all languages have an alpha-2 and all languages should have an alpha-3.

Theoretically, we could have one field that contains the alpha-2 if it exists and the alpha-3 otherwise, but that seems messy and confusing.

^ Relatedly, it could be helpful to have a "Wikimedia language code" which -- for the most part, but of course with edge cases -- maps to the alpha-2 if it exists and alpha-3 otherwise. This could potentially also serve as the foreign key. But I realize, too, that this Wikimedia language code is technically already available in the TSV via domain_name or database_code with some REGEX.

Also, if you haven't already, we'll need to think about codes for multi-lingual wikis. "mul" seems the obvious choice for the incubators, but what about Wikifunctions, Commons, and Wikidata, where the content is fairly trans-lingual but the community activity is probably overwhelming English?

Yes, I agree "mul" makes the most sense, per Wikipedia and SIL.

@CMyrick-WMF thank you! That's very helpful.

^ Relatedly, it could be helpful to have a "Wikimedia language code" which -- for the most part, but of course with edge cases -- maps to the alpha-2 if it exists and alpha-3 otherwise. This could potentially also serve as the foreign key. But I realize, too, that this Wikimedia language code is technically already available in the TSV via domain_name or database_code with some REGEX.

I definitely agree! I also agree the database code isn't enough since one of the points of this is to save folks from having to mess around with regex to get this basic information. But the great thing is we actually already have this! 😁 It's the language_code field (description in the readme).

Ideally it would be named wikimedia_language_code instead, but I think we're stuck with the existing name, at least for now!

We actually already have this! 😁 It's the language_code field

OH! 🤦🏻‍♀️ Thanks for reminding me.

nshahquinn-wmf raised the priority of this task from Low to Medium.May 1 2025, 1:46 AM
nshahquinn-wmf moved this task from Backlog to FY24-25 H2 on the Movement-Insights board.
CMyrick-WMF lowered the priority of this task from Medium to Low.
CMyrick-WMF changed the task status from Open to In Progress.Aug 5 2025, 3:01 PM

@CMyrick-WMF I just realized you've actually been working on T392951: Create a first version of the canonical language dataset! We've actually decided not to do this task (add ISO 639 codes to the wiki dataset), since we're just going to use the existing Wikimedia language code as the foreign key to the new language dataset.

I'll move the status over to the other task and decline this one.