Handling of character 'ɪ' in cl_sortkey in database table categorylinks
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Formatierer
	May 3 2020, 11:51 AM

Description

In german wiktionary sortkeys of categories are stored in the database table categorylinks in field cl_sortkey in uppercase. Some time ago (i dont know when this happend) mediawiki software changed the behavior the character 'ɪ' Unicode Character 'LATIN LETTER SMALL CAPITAL I' (U+026A) is handled.
Before this change the character 'ɪ' was not converted to uppercase when stored into field cl_sortkey. After that software change the character 'ɪ' is converted to uppercase Unicode Character 'LATIN CAPITAL LETTER SMALL CAPITAL I' (U+A7AE).

It seems that it has been forgotten to patch the old values, which where entered before this software change occured. So now we have have a mixture of sortkey values in the database some of them contain U+026A and some U+A7AE.

This has confusing effects in viewing the category pages, because some entries dont appear in the list.
https://de.wiktionary.org/w/index.php?title=Kategorie:Reim_(Deutsch)&pagefrom=%C6%86%C9%AA%CC%AFt
The page https://de.wiktionary.org/wiki/Reim:Deutsch:-%C9%94%C9%AA%CC%AFt%C9%99 is missing in the list.

To verify the problem in the database make a select on database table categorylinks and compare the values generated for the field cl_sortkey. Some random examples not related to the links posted above:

cl_from cl_to           cl_sortkey              cl_timestamp	cl_sortkey_prefix cl_collation cl_type

301735	Reim_(Deutsch)	ɪNDƏT DEUTSCH:-ɪNDƏT	2013-01-21T12:21:58	ɪndət	uppercase	page
925336	Reim_(Deutsch)	ꞮTVⱯ DEUTSCH:-ꞮTVⱯ	2019-04-21T12:56:30	ɪtvɐ	uppercase	page

The cl_timestamp of the last cl_sortkey value which contains an 'ɪ' might help to find the time when this behavior changed.

Unfortunately i can not say which categories are effected, it might be all. So to correct this i think it is the best to regenerate all sortkeys of all categories. Or just those which contain an 'ɪ', but i dont know if this is the only character which is effected by the software change.

And even more unfortunately this would have to be done on all other wikis, if not done yet.

Related Objects

Mentioned In: T319432: Migrate WMF production from PHP 7.4 to PHP 8.1
T323868: category sortkey generation of german letter ß changed
Mentioned Here: T323868: category sortkey generation of german letter ß changed
T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes

Event Timeline

Formatierer created this task.May 3 2020, 11:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2020, 11:51 AM

RhinosF1 added a project: MediaWiki-Categories.May 3 2020, 11:52 AM

Restricted Application added a subscriber: RhinosF1. · View Herald TranscriptMay 3 2020, 11:52 AM

taavi added a project: DBA.May 3 2020, 11:53 AM

Marostegui edited projects, added Wikimedia-database-issue, MediaWiki-libs-Rdbms; removed DBA.May 3 2020, 12:18 PM

Restricted Application added a project: Platform Engineering. · View Herald TranscriptMay 3 2020, 12:18 PM

Reedy removed projects: MediaWiki-libs-Rdbms, Wikimedia-database-issue.May 3 2020, 1:51 PM

Aklapper removed a project: Platform Engineering.May 3 2020, 2:33 PM

I found two other characters which are effected by this conversion to uppercase. So far we have:

Char  Unicode Character
'ɐ'   LATIN SMALL LETTER TURNED A (U+0250)
'ɡ'   LATIN SMALL LETTER SCRIPT G (U+0261) not to be confused with ASCII 'g' 'LATIN SMALL LETTER G (U+0067)'
'ɪ'   LATIN LETTER SMALL CAPITAL I (U+026A)

These characters are mostly used in IPA-Notations and the effect came into german wiktionary between 2019-09-07T20:02:48 and 2019-09-09T18:10:14 (database timestamp)

This might be related to https://phabricator.wikimedia.org/T219279

So will the column cl_sortkey of the database table categorylinks ever be fixed or not?

Formatierer mentioned this in T323868: category sortkey generation of german letter ß changed.Dec 3 2022, 3:52 PM

TheresNoTime removed a subscriber: RhinosF1.Dec 15 2022, 11:36 PM

Sorry that no one responded to this task. I think we could handle this as part of T323868 now.

matmarex mentioned this in T319432: Migrate WMF production from PHP 7.4 to PHP 8.1.Sep 25 2023, 7:35 PM

Handling of character 'ɪ' in cl_sortkey in database table categorylinksClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Handling of character 'ɪ' in cl_sortkey in database table categorylinks
Closed, DeclinedPublic
Actions