Page MenuHomePhabricator

Babel language codes should be normalised to lower case when used in categories
Open, NormalPublic

Description

On enwiki the Babel categories are in the format of "Category:User xx", where xx is always in lower case. However, if a user enters codes with a different capitalisation in their #babel invocation, the page is categorised with that different capitalisation.

For example, on enwiki, the code {{#babel: En}} will add the page to the category "Category:User En", when it should be "Category:User en".

This also has the consequence that [[User:Babel AutoCreate]] creates duplicate categories for each different code capitalisation that someone has used. I see it has created both [[:Category:User Zh]] and [[:Category:User zh]], for instance. (I have blocked the Babel AutoCreate account on enwiki until we can find a way to fix this.)

I assume that lower case is the preferred code format for other wikis, and that seemed to be the case when I spot-checked a few of them. However, if any wikis use a different system, code capitalisation might need to be made configurable, rather than lower case being hard-coded.

Details

Reference
bz61993

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:00 AM
bzimport set Reference to bz61993.
bzimport added a subscriber: Unknown Object (MLST).
Nikki added a subscriber: Nikki.Jul 7 2015, 9:47 AM
MarcoAurelio raised the priority of this task from Normal to Needs Triage.Aug 13 2015, 6:30 PM
MarcoAurelio added a subscriber: MarcoAurelio.

I've just blocked the extension at Meta for doing this too. The "bot" is creating a lot of empty categories, categories for languages that don't exist, that no user added to their page, or duplicate categories (e. g.: Category:User es [correct], Category: User Es [bad], Category: User ES [bad]).

(Retriaging, old bug imported from BZ that needs proper assesment of severity)

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2015, 6:30 PM
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 29 2015, 2:06 PM

@MarcoAurelio @MrStradivarius Would it be reasonable to implement a fix where the extension always use lowercase language code?

Dereckson triaged this task as High priority.Dec 31 2015, 3:57 PM

Priority set to high, per T112868.

Change 289604 had a related patch set uploaded (by Ricordisamoa):
Normalise language codes to lower case when used in categories

https://gerrit.wikimedia.org/r/289604

Change 289604 merged by jenkins-bot:
Normalise language codes to lower case when used in categories

https://gerrit.wikimedia.org/r/289604

Nikerabbit closed this task as Resolved.Aug 16 2016, 9:01 AM
Nikerabbit removed a project: Patch-For-Review.
Nikerabbit updated the task description. (Show Details)
Nikki reopened this task as Open.Aug 20 2016, 10:13 AM

I'm reopening this because the "fix" appears to have broken it even more. Now parts of the code are being always capitalised when they shouldn't be, e.g. most wikis I've seen use lowercase, but now it's force uppercasing countries and scripts, so the existing categories are empty and people's user pages now point to non-existing categories (e.g. https://www.wikidata.org/wiki/User:Addshore). See the most recent commeont on T112868 too.

As far as I can tell, the preferred format is all lowercase, including countries and scripts, with the exception for the letter "N" to indicate native speaker level (or alternatives in other languages like "M" in German):

This query attempts to list codes with capitalised countries or scripts
This one attempts to list codes using lowercased ones.

While there are probably a few missing from those queries because they use a different syntax or aren't linked to Wikidata or the Wikidata item isn't marked as a user language category, it's quite clear that lowercase is predominant. I don't know how many of the capitalised ones are the preferred style for that wiki but force lowercasing the country and script would cause much less disruption than force capitalising them.

Please consider disabling this "bot" until it can be fixed. At en.wikiquote it is not only creating these spurious categories (miscapitalized and redundant to correctly capitalized ones), but it is REcreating them when they have been deleted.

Please consider disabling this "bot" until it can be fixed.

I see that this was already requested at T132296, but it was closed as "Declined" for some reason. Does this mean that we must continue blocking the pseudo-account at individual wikis one by one to limit the damage?

Nikki added a comment.Oct 15 2016, 1:22 PM

@Ningauble: You could perhaps request a temporary global lock at https://meta.wikimedia.org/wiki/Steward_requests/Global - that should have the same effect as stopping the bot by preventing the bot from editing.

Another problem with the current behaviour: It is turning "roa-tara" (a special non-standard code - https://meta.wikimedia.org/wiki/Special_language_codes) into "roa-Tara" as if "tara" were a script (e.g. https://it.wikinews.org/wiki/Categoria:Utenti_roa-Tara)

I don't think global locks do affect system accounts. Maybe it does in this
case.

I have blocked the bot account at en.wikiquote because this is still happening. Other wikis will have to fend for themselves.

Please notify wikis where it is blocked (https://meta.wikimedia.org/wiki/Special:CentralAuth/Babel_AutoCreate) at their Administrators' Noticeboards (or equivalent) when (if) this is fixed (really fixed).

This comment was removed by Liuxinyu970226.
TTO added a subscriber: TTO.Jan 1 2017, 2:55 PM

The Babel extension normalises language codes according to the internet standard BCP-47 (https://en.wikipedia.org/wiki/IETF_language_tag). Languages like pt-BR and zh-Hans are capitalised as such. Obviously roa-tara falls through the cracks, but it's a bit of a naughty, norm-defying language code, so it might need to be special-cased in the Babel code.

I think the expectation that Babel should output lowercase category names is misguided. The fact that MediaWiki internally uses that style doesn't make it right. Language tags on HTML pages, the list in your MediaWiki preferences, and just about anywhere else you care to look, all use the BCP-47 style.

There's no doubt that Babel AutoCreate has made a complete mess of the babel category system on various wikis over the years... It would be useful to have a script that goes through and moves categories like "User pt-br" to the proper name, along with T62162 and any other tasks desired by the wiki community.

Nikki added a comment.Jan 1 2017, 3:36 PM

Those language tags are not case sensitive, so "pt-br" is still perfectly valid and all lowercase is what Wikimedia projects have used for years (including in pre-Babel templates, which are still widely used). Trying to force a new style on everyone is really disruptive, wikis already have a well-established system of categories which has now been completely messed up by Babel suddenly switching to different capitalisation. It's no wonder Babel AutoCreate keeps being indefinitely blocked and there are even people who want it turned off or globally blocked because it's so disruptive.

I have now blocked the bot account on svwiki. Any progress?

cscott added a subscriber: cscott.Oct 17 2018, 2:30 AM

In https://gerrit.wikimedia.org/r/446766 I introduced BabelLanguageCodes::getCategoryCode() which maps mediawiki-internal language codes to appropriate category names. The current algorithm is to use the (lowercased) mediawiki internal code if it doesn't contain a hyphen (eg en, simple, de), otherwise use the properly-capitalized BCP 47 code (zh-Hans, etc). This matched previous expectations as canonized in the extensions phpunit tests. If we wanted some other behavior for category codes it ought to be straightforward to patch getCategoryCode() for whatever is desired.

Aklapper lowered the priority of this task from High to Normal.May 23 2019, 6:06 PM