Page MenuHomePhabricator

category sortkey generation of german letter ß changed
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • It is a software internal problem, see text and links

The generation of category sortkeys for articles containing the german letter 'ß' changed someday between database timestamp 2022-09-22T15:55:53 and 2022-09-29T17:59:41

As you see below wgCategoryCollation in dewiktionary is set to "uppercase"

Historically there existed no uppercase form of the letter 'ß'.
Digression (not part of this report): But times are changing
https://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F and https://en.wikipedia.org/wiki/%C3%9F#Development_of_a_capital_form

Because there existed no uppercase form of 'ß' the sortkey was generated leaving the 'ß' as it was. See database report of table categorylinks below.
But after database timestamp 2022-09-29T17:59:41 all sortkeys were generated converting 'ß' to 'SS'. So this change must be included in an release installed before this time.

The QUARRY database query of table categorylinks reports the last time with old sortkey and the first time with new sortkey:

cl_from cl_to cl_sortkey cl_timestamp cl_sortkey_prefix cl_collation cl_type
1294232 Rückläufige_Wörterliste_(Deutsch) EGELFPNEßARTS STRAßENPFLEGE 2022-09-22T15:55:53 egelfpneßartS uppercase page

1295614 Rückläufige_Wörterliste_(Deutsch) NEMMASUZ SSEIL LIESS ZUSAMMEN 2022-09-29T17:59:41 nemmasuz ßeil uppercase page

Why is this a problem? We now have sortkeys of the old form and sortkeys of the new form. The order of category entries is not comprehensible.
Another problem is the method wikimedia lists headers of the first letter in articles on category pages. They are grouped by the first letter of the articles.

After modifying the sortkeys of the article https://de.wiktionary.org/w/index.php?title=%C3%9F the sortkey is now 'SS'. It should be listed after the article 'Sri-Lankerin' and before 'SS-Brigadeführer' in the category "Kategorie:Substantiv_(Deutsch)", but it is listed at the end of the category page under the header 'ß'. See https://de.wiktionary.org/w/index.php?title=Kategorie:Substantiv_(Deutsch)&pagefrom=Sri-Lanker. People browsing to the next page will be confused.

But if you start category browsing at article 'SS' everything seems to look correct https://de.wiktionary.org/w/index.php?title=Kategorie:Substantiv_(Deutsch)&pagefrom=SS Browsing one page back the last entry on the page is 'Sri-Lankerin' as expected. But this only looks correct for those, who know about the internal sorting algorithm.

I currently dont know, if this change in generating sortkeys is a bug or a feature, intended or unintended. But in the form as it is now implemented it is not acceptable.

Event Timeline

With a new php version there are some changes in the case mapping include ß (reference T319432)

Not sure if there should be visible effects right now or not.

wgCategoryCollation is set to "uppercase", so for ß the new uppercase SS could be in use.
Sortkeys are stored in the database, only new sortkeys generated after a edit are changed in the database, which does not give a stable sort order until everything is migrated.

Of course, there are visible effects. Without them we haven't noticed this change in sortkey generation.

So how can we request a migration for all sortkeys? We had the same situation with other letters, when PHP was updated from PHP 6 to PHP 7. We opened a bug report T251698, but nothing happend.

A migration of sortkeys should always be done, if a software, which has an influence on the generation of the keys, changes.

What we can do is a request for configuration change to set wgCategoryCollation to "uca-default".
Then script "maintenance/updateCollation.php" has to be run.
After this is done, we can request to set wgCategoryCollation back to "uppercase".
Then script "maintenance/updateCollation.php" has to be run again.
After that all our sortkeys should be up to date.