Page MenuHomePhabricator

category sortkey generation of german letter ß changed
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • It is a software internal problem, see text and links

The generation of category sortkeys for articles containing the german letter 'ß' changed someday between database timestamp 2022-09-22T15:55:53 and 2022-09-29T17:59:41

As you see below wgCategoryCollation in dewiktionary is set to "uppercase"

Historically there existed no uppercase form of the letter 'ß'.
Digression (not part of this report): But times are changing
https://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F and https://en.wikipedia.org/wiki/%C3%9F#Development_of_a_capital_form

Because there existed no uppercase form of 'ß' the sortkey was generated leaving the 'ß' as it was. See database report of table categorylinks below.
But after database timestamp 2022-09-29T17:59:41 all sortkeys were generated converting 'ß' to 'SS'. So this change must be included in an release installed before this time.

The QUARRY database query of table categorylinks reports the last time with old sortkey and the first time with new sortkey:

cl_from cl_to cl_sortkey cl_timestamp cl_sortkey_prefix cl_collation cl_type
1294232 Rückläufige_Wörterliste_(Deutsch) EGELFPNEßARTS STRAßENPFLEGE 2022-09-22T15:55:53 egelfpneßartS uppercase page

1295614 Rückläufige_Wörterliste_(Deutsch) NEMMASUZ SSEIL LIESS ZUSAMMEN 2022-09-29T17:59:41 nemmasuz ßeil uppercase page

Why is this a problem? We now have sortkeys of the old form and sortkeys of the new form. The order of category entries is not comprehensible.
Another problem is the method wikimedia lists headers of the first letter in articles on category pages. They are grouped by the first letter of the articles.

After modifying the sortkeys of the article https://de.wiktionary.org/w/index.php?title=%C3%9F the sortkey is now 'SS'. It should be listed after the article 'Sri-Lankerin' and before 'SS-Brigadeführer' in the category "Kategorie:Substantiv_(Deutsch)", but it is listed at the end of the category page under the header 'ß'. See https://de.wiktionary.org/w/index.php?title=Kategorie:Substantiv_(Deutsch)&pagefrom=Sri-Lanker. People browsing to the next page will be confused.

But if you start category browsing at article 'SS' everything seems to look correct https://de.wiktionary.org/w/index.php?title=Kategorie:Substantiv_(Deutsch)&pagefrom=SS Browsing one page back the last entry on the page is 'Sri-Lankerin' as expected. But this only looks correct for those, who know about the internal sorting algorithm.

I currently dont know, if this change in generating sortkeys is a bug or a feature, intended or unintended. But in the form as it is now implemented it is not acceptable.

Event Timeline

With a new php version there are some changes in the case mapping include ß (reference T319432)

Not sure if there should be visible effects right now or not.

wgCategoryCollation is set to "uppercase", so for ß the new uppercase SS could be in use.
Sortkeys are stored in the database, only new sortkeys generated after a edit are changed in the database, which does not give a stable sort order until everything is migrated.

Of course, there are visible effects. Without them we haven't noticed this change in sortkey generation.

So how can we request a migration for all sortkeys? We had the same situation with other letters, when PHP was updated from PHP 6 to PHP 7. We opened a bug report T251698, but nothing happend.

A migration of sortkeys should always be done, if a software, which has an influence on the generation of the keys, changes.

What we can do is a request for configuration change to set wgCategoryCollation to "uca-default".
Then script "maintenance/updateCollation.php" has to be run.
After this is done, we can request to set wgCategoryCollation back to "uppercase".
Then script "maintenance/updateCollation.php" has to be run again.
After that all our sortkeys should be up to date.

German Wikipedia is using ASCII codepoints like !#,:* and more for special effects.

If we would migrate to UCA many deliberate groupings would pass away.

We do provide explicit sortkeys for good reasons, also for umlauts and ß.

German Wikipedia is using ASCII codepoints like !#,:* and more for special effects.

If we would migrate to UCA many deliberate groupings would pass away.

This is not the case, different symbols are still sorted separately with UCA collations. Example: https://pl.wikipedia.org/wiki/Kategoria:Socjologia (their order might be different though)

Switching German Wikipedia to an UCA collation was rejected a few years ago for reasons that aren't clear to me (T128806), so if someone wanted to propose that again, you'd probably need to have a new discussion about it.

Anyway, I think this task it not about German Wikipedia.

A migration of sortkeys should always be done, if a software, which has an influence on the generation of the keys, changes.

That's true, and we have a fairly routine process for doing this on wikis using UCA collations (here's an example from 2020: T264991), but it seems that either we overlooked that the 'uppercase' collations are also affected by some software upgrades, or someone decided that the changes are rare and small enough that it's not worth performing the migration on all wikis (it's a relatively slow and resource-intensive process). It seems to me that the ß change is pretty significant, though.

What we can do is a request for configuration change to set wgCategoryCollation to "uca-default".
Then script "maintenance/updateCollation.php" has to be run.
After this is done, we can request to set wgCategoryCollation back to "uppercase".
Then script "maintenance/updateCollation.php" has to be run again.
After that all our sortkeys should be up to date.

Correct, but this can be done in a simpler way – we can run maintenance/updateCollation.php --force to recompute the sortkeys without changing the configuration back-and-forth.

We can follow the process at https://meta.wikimedia.org/wiki/Requesting_wiki_configuration_changes. This seems like an uncontroversial change, so I don't think you need to hold a discussion about this. I could have the maintenance script started next week.

So far it looks like this should be done for German Wiktionary and for Commons. Should we perform the migration on any other German-language or multilingual wikis?

It has been refused since the resulting order is not the sequence which the community does need for thematic arrangement. It goes for names only and does not take into account deliberately distinguished structures based upon ASCII codes.

The request made above was to turn (all wikis, but at least German WP) to UCA default, which would break many applications.

The proposal has been evaluated, and it turned out that for more than 100.000 pages new individual sortkeys for non-name but ASCII encoded keys will need to be attached, and there is nobody who can create any algorithmic procedure for this. Furthermore a lot of desired sequences cannot be achieved in UCA; no solutions were found yet.

The proposal implies that all sortkeys are human language names of things only, but the sortkeys are used in German WP also for structural elements based upon deliberately chosen ASCII codes.

Probably it might work for German Wiktionary (I am not involved there), but it will definitely break German Wikipedia special article categories and systems out of article space.

Some rows are affected:

MariaDB [commonswiki_p]> select count(*) from categorylinks where cl_sortkey like '%ß%';
+----------+
| count(*) |
+----------+
|  5726636 |
+----------+
1 row in set (7 min 8.384 sec)

MariaDB [dewiki_p]> select count(*) from categorylinks where cl_sortkey like '%ß%';
+----------+
| count(*) |
+----------+
|   355433 |
+----------+
1 row in set (17.516 sec)

MariaDB [dewiktionary_p]> select count(*) from categorylinks where cl_sortkey like '%ß%';
+----------+
| count(*) |
+----------+
|   152100 |
+----------+
1 row in set (7.981 sec)

We changed a template (CH&LI) which was present in all pages containing an ß in german wiktionary, so that all related sortkeys were updated.

select count(*) from categorylinks where cl_sortkey like '%ß%';
Executed in 8.94 seconds as of Sat, 24 Feb 2024 10:47:04 UTC.
Resultset (0 rows)

So german wiktionary sortkeys are ß-free now.

matmarex assigned this task to Formatierer.

Well, that's one way to fix the problem. Thanks for letting us know @Formatierer, and sorry that we didn't do it from the MediaWiki side.

Because someone asked me, a last notice: I think now, the problem was solved before by some repair-script. But it was forgotten to mention this here. Sortkeys of dewiki, enwiktionary and frwiktionary except commons are also ß-free now.

Please don't forget Commons! It's important because if I have 50 or 100 subcategories for a street name, I want to know at one glance whether there is a category for a particular house number or not.