Page MenuHomePhabricator

Handling of character 'ɪ' in cl_sortkey in database table categorylinks
Closed, DeclinedPublic

Description

In german wiktionary sortkeys of categories are stored in the database table categorylinks in field cl_sortkey in uppercase. Some time ago (i dont know when this happend) mediawiki software changed the behavior the character 'ɪ' Unicode Character 'LATIN LETTER SMALL CAPITAL I' (U+026A) is handled.
Before this change the character 'ɪ' was not converted to uppercase when stored into field cl_sortkey. After that software change the character 'ɪ' is converted to uppercase Unicode Character 'LATIN CAPITAL LETTER SMALL CAPITAL I' (U+A7AE).

It seems that it has been forgotten to patch the old values, which where entered before this software change occured. So now we have have a mixture of sortkey values in the database some of them contain U+026A and some U+A7AE.

This has confusing effects in viewing the category pages, because some entries dont appear in the list.
https://de.wiktionary.org/w/index.php?title=Kategorie:Reim_(Deutsch)&pagefrom=%C6%86%C9%AA%CC%AFt
The page https://de.wiktionary.org/wiki/Reim:Deutsch:-%C9%94%C9%AA%CC%AFt%C9%99 is missing in the list.

To verify the problem in the database make a select on database table categorylinks and compare the values generated for the field cl_sortkey. Some random examples not related to the links posted above:

cl_from cl_to           cl_sortkey              cl_timestamp	cl_sortkey_prefix cl_collation cl_type

301735	Reim_(Deutsch)	ɪNDƏT DEUTSCH:-ɪNDƏT	2013-01-21T12:21:58	ɪndət	uppercase	page
925336	Reim_(Deutsch)	ꞮTVⱯ DEUTSCH:-ꞮTVⱯ	2019-04-21T12:56:30	ɪtvɐ	uppercase	page

The cl_timestamp of the last cl_sortkey value which contains an 'ɪ' might help to find the time when this behavior changed.

Unfortunately i can not say which categories are effected, it might be all. So to correct this i think it is the best to regenerate all sortkeys of all categories. Or just those which contain an 'ɪ', but i dont know if this is the only character which is effected by the software change.

And even more unfortunately this would have to be done on all other wikis, if not done yet.

Event Timeline

I found two other characters which are effected by this conversion to uppercase. So far we have:

Char  Unicode Character
'ɐ'   LATIN SMALL LETTER TURNED A (U+0250)
'ɡ'   LATIN SMALL LETTER SCRIPT G (U+0261) not to be confused with ASCII 'g' 'LATIN SMALL LETTER G (U+0067)'
'ɪ'   LATIN LETTER SMALL CAPITAL I (U+026A)

These characters are mostly used in IPA-Notations and the effect came into german wiktionary between 2019-09-07T20:02:48 and 2019-09-09T18:10:14 (database timestamp)

This might be related to https://phabricator.wikimedia.org/T219279

So will the column cl_sortkey of the database table categorylinks ever be fixed or not?

matmarex subscribed.

Sorry that no one responded to this task. I think we could handle this as part of T323868 now.