Page MenuHomePhabricator

Lithuanian Category Collation: Articles starting with y grouped together with articles starting with i, but those are two different letters
Open, Needs TriagePublic

Description

Steps to Reproduce

Example: category "Social networks" https://lt.wikipedia.org/wiki/Kategorija:Socialiniai_tinklai has an article "Youtube" listed under I (capital i) while it should be under Y.

Event Timeline

Nomad renamed this task from Articles starting with y are being grouped together with articles starting with i but those are two different letters to Articles starting with y are being grouped together with articles starting with i on Lithuanian wikipedia, but those are two different letters .Apr 11 2020, 8:57 AM
Aklapper renamed this task from Articles starting with y are being grouped together with articles starting with i on Lithuanian wikipedia, but those are two different letters to Lithuanian Category Collation: Articles starting with y grouped together with articles starting with i, but those are two different letters.Apr 11 2020, 9:14 AM
Aklapper added a project: MediaWiki-Categories.

Hi @Nomad, thanks for taking the time to report this and welcome to Wikimedia Phabricator!

Quick summary / some background:
Looking at wgCategoryCollation in https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/InitialiseSettings.php : 'ltwiki' => 'uca-lt', // T123627. T123627 and https://en.wikipedia.org/wiki/Lithuanian_orthography#Alphabet state Ii Įį Yy Jj.
uca-lt is defined in https://github.com/unicode-org/cldr/blob/master/common/collation/lt.xml

It's been three months already is this such a complicated bug?

@Nomad: Feel free to write a software patch (in the operations/mediawiki-config repository, see last comment) if you'd like this to happen faster. Thanks!
See https://www.mediawiki.org/wiki/Bug_management/Development_prioritization#Why_has_nobody_fixed_this_issue_yet%3F for general info.

If it's a bug, then the bug is in the ICU sort order not in MediaWiki's first letter identification.

> $collator = new Collator('lt');
> print $collator->compare('YouTube', 'Ixia');
-1
> print $collator->compare('YouTube', 'Instagram');
1

So a sorted list of those three words would be: Instagram, YouTube, Ixia. If we had separate headings for I and Y then the sections would be split, with duplicate headings for letter I:

== I ==
* Instagram

== Y ==
* YouTube

== I ==
* Ixia

That's why I and Y need to be combined into a single section.