Page MenuHomePhabricator

Consider changing UCA category sort key format to handle sort key prefixes correctly
Open, Needs TriagePublic

Description

Currently, when the user provides a sort key, whether in the category link or using DEFAULTSORT, it is used by MediaWiki as a sort key prefix. Specifically, both the prefix and a separating line feed are prepended to the page's unnamespaced title, in Title::getCategorySortKey():

if ( $prefix !== '' ) {
	# Separate with a line feed, so the unprefixed part is only used as
	# a tiebreaker when two pages have the exact same prefix.
	# In UCA, tab is the only character that can sort above LF
	# so we strip both of them from the original prefix.
	$prefix = strtr( $prefix, "\n\t", '  ' );
	return "$prefix\n$unprefixed";
}

While at first glance it may seem appropriate to merge the components of the actual sort key like this -- after all, the ICU User Guide recommends merging sort key components (e.g. for a combined last/first name sort) using U+FFFE -- it may not be appropriate in this case.

By default, the sort keys generated by ICU consist of three levels of collation, starting with base character distinctions, followed by accent distinctions, then by case/variant distinctions. Hence, the combined sort key is roughly in this format:

prefix1, title1, prefix2, title2, prefix3, title3

Compare this to the comment though. The "unprefixed part" is supposed to be "only used as a tiebreaker when two pages have the exact same prefix." So I would expect a format roughly like this:

prefix1, prefix2, prefix3, title1, title2, title3

That is, the prefixing would happen after IcuCollation::getSortKey(), not beforehand. One way to implement this would be for the function to split the provided string into its line feed separated components, then process each component using Collator::getSortKey(), then combine the results using a null byte as a separator. (ICU's sort keys do not contain any null byte except as a terminator, as mentioned in T137642: IcuCollation sort keys depend on PHP/HHVM version, and importantly, null bytes sort below all other byte values.)

The changes could be implemented under a new set of collation IDs (possibly using new subclasses), which would decouple the migration to the new format from the deployment of new MediaWiki branches. Because WMF needs to be able to roll back to the previous MediaWiki branch if necessary, and should be able to do a staged migration to the new sort key format, that is probably very important.