Add language codes sr-cyrl and sr-latn on Wikidata
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Nikki
	Oct 27 2022, 6:11 PM

Description

sr-cyrl and sr-latn should be added as language codes in Wikidata for labels, monolingual text and lexemes.

The existing codes sr-ec and sr-el are Wikimedia inventions and there is work being done to eventually switch everything to using the correct codes sr-cyrl and sr-latn (T125073, T117845).

Making sr-cyrl and sr-latn available in Wikidata now would be a good idea because:

It would allow us to start migrating the data already. There is a lot, so it will take some time.
It would resolve the issue described in T262269 where it's not possible for people to add Serbian (Latin script) or Serbian (Cyrillic script) to their termbox, because the data uses sr-el and sr-ec but the language codes in the Babel box are normalised to sr-latn and sr-cyrl.
It would be a proper way to resolve the inconsistency in which language codes are being used for lexemes, with some people using sr-el and sr-ec and others using sr-x-Q2839566 and sr-x-Q829464.
The RDF output would use valid language codes even before T243428 is fixed.

Data that needs migrating:

sr-ec labels/aliases: 750,000
sr-el labels/aliases: 1.5 million
sr-ec descriptions: 24.6 million
sr-el descriptions: 23.3 million
Wikidata monolingual text statements, qualifiers and references: unknown, it's not possible to search for them
Lexemes: A handful of lemmas, glosses and forms
sr-el captions on Commons: 6500
sr-ec captions on Commons: 3000
SDC monolingual text statements, qualifiers and references: unknown, it's not possible to search for them

Related Objects

Mentioned Here: T51024: [Task] Removal of de-formal from allowed language for labels
T284808: Add a configuration variable that allows disabling language codes for labels, descriptions, and aliases
T320887: Language codes that are explicitly not allowed for monolingual text should also not be allowed for lexemes
T117845: Rename the language codes sr-ec and sr-el to the BCP 47 conform codes sr-Cyrl and sr-Latn
T125073: [Story] Replace bad, but currently necessary language codes
T243428: Use standard language codes in RDF output
T262269: Make Serbian (sr-el) language available for terms (labels/descriptions/aliases)

Event Timeline

Nikki created this task.Oct 27 2022, 6:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2022, 6:11 PM

This seems like a well-reasoned proposal to me, and given the amount of data that needs migrating, I think it would be good to do this as soon as possible.
Are there any special considerations that need to be made? Would it be possible to change the language codes automatically somehow (e.g. via a maintenance script), or should it be done by the communtiy (e.g. by bot)?

LangCom has no objections to using standard language codes, of course. :-)

Good to go from my side as well. Let's do it.
I fear the data migration needs to be done by the community via a bot.
As for how the migration happens: Should we add the new ones, migrate all the data, and then remove the old ones?

Lydia_Pintscher triaged this task as Medium priority.Mar 21 2023, 8:58 AM

Winston_Sung moved this task from Backlog to Wikidata (termbox languages) on the Language codes board.Apr 19 2023, 5:14 PM

In T321852#8713041, @Lydia_Pintscher wrote:

Good to go from my side as well. Let's do it.
I fear the data migration needs to be done by the community via a bot.

That's fine. We have plenty of bots (and non-bots) mass editing labels already.

As for how the migration happens: Should we add the new ones, migrate all the data, and then remove the old ones?

We still don't have a way to disable language codes (see T51024, T284808, T320887, etc) but otherwise yes.

Add language codes sr-cyrl and sr-latn on WikidataOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Add language codes sr-cyrl and sr-latn on Wikidata
Open, MediumPublic
Actions