Page MenuHomePhabricator

cl_sortkey_prefix crops unicode string mid character
Closed, ResolvedPublic

Description

Per the mw.org manual cl_sortkey_prefix is supposed to be "the human readable version of cl_sortkey" (when the dafault sort key is not used).

For very long non-latin sort keys (e.g. قصر البارون امبان بمصر الجديدة.jpg) the sortkey (for Category:Cultural heritage monuments in Egypt with known IDs) gets cropped to fit in the table, however it appears that this cropping does not respect the encoding of the string meaning the cropping stops mid unicode character. As a result the cl_sortkey_prefix cannot be converted back to unicode and cannot be said to be human readable.

The desired result would be for the crop to be encoding aware and drop that last partial cahracter.

Event Timeline

This arose from T200325.

This has potentially been touched on in T155529 which at least suggests that the mw.org manual is incorrect in promising that cl_sortkey_prefix should be human readable.

Change 449280 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] Use multibyte-aware truncation to avoid invalid UTF-8 in cl_sortkey_prefix

https://gerrit.wikimedia.org/r/449280

matmarex triaged this task as Low priority.
matmarex edited projects, added MediaWiki-Categories; removed MediaWiki-General.

Change 449280 merged by jenkins-bot:
[mediawiki/core@master] Use multibyte-aware truncation to avoid invalid UTF-8 in cl_sortkey_prefix

https://gerrit.wikimedia.org/r/449280

Note that this also happens for [[https://www.mediawiki.org/wiki/Manual:Categorylinks_table#cl_sortkey|cl.sortkey]] but I didn't raise that originally since the manual mentions that it may or may not be readable by a human.