Page MenuHomePhabricator

Garbage characters in category first-letter heading for numbers 2 and 3 (due to collation handling?)
Closed, ResolvedPublic

Description

For some reason, when we have sortkeys with numbers in a category, the numbers 2 and 3 are printed as garbage characters in the category page.

See for example: https://www.wikidex.net/wiki/Categor%C3%ADa:Pok%C3%A9mon_por_generaci%C3%B3n

I'm able to reproduce it on a fresh installation with $wgCategoryCollation = 'uca-default';

To reproduce, create some pages and add each to the same category, with a number as the sortkey. For example (each line on a different page):

[[Category:Example|1]]
[[Category:Example|2]]
[[Category:Example|3]]

Now visit the Category:Example page and you'll see something like this:

imagen.png (458×430 px, 18 KB)

Version: MediaWiki 1.30 (but it was also happening in MediaWiki 1.29)

Event Timeline

I've debugged until the fetchFirstLetterData method of IcuCollation.php

		/* Sort the letters.
		 *
		 * It's impossible to have the precompiled data file properly sorted,
		 * because the sort order changes depending on ICU version. If the
		 * array is not properly sorted, the binary search will return random
		 * results.
		 *
		 * We also take this opportunity to remove primary collisions.
		 */
		$letterMap = [];
		foreach ( $letters as $letter ) {
			$key = $this->getPrimarySortKey( $letter );
			if ( isset( $letterMap[$key] ) ) {
				// Primary collision
				// Keep whichever one sorts first in the main collator
				if ( $this->mainCollator->compare( $letter, $letterMap[$key] ) < 0 ) {
					$letterMap[$key] = $letter;
				}
			} else {
				$letterMap[$key] = $letter;
			}
		}
		ksort( $letterMap, SORT_STRING );

There's a Primary collision for letters 2 and 3. I've added a debug line there and surprisingly those are the only collisions. Output (in brackets the character, in parentheses the hex representation with bin2hex):

  • Collision: $letter: [2] (32), $key: [�] (18), existing letter in $letterMap[$key]: [𒑖] (f0929196)
  • Collision: $letter: [3] (33), $key: [�] (1a), existing letter in $letterMap[$key]: [𒑗] (f0929197)

The if clause of the Primary collision isn't satisfied. If I force the code to enter it and overwrite the existing $letterMap[$key], the numbers appear as expected.

Maybe the problem here is the collision, that shouldn't happen for those numbers?

CC'ing Tim per rMWeaeea84b44dc58c6c9e2353462fd38759eecc613

Aklapper renamed this task from Garbage characters in category first-letter heading for numbers 2 and 3 to Garbage characters in category first-letter heading for numbers 2 and 3 (due to collation handling?).May 2 2018, 10:06 AM
Aklapper added a subscriber: Bdijkstra.
Aklapper added a subscriber: matmarex.
This comment was removed by Bdijkstra.

I didn't have time to investigate earlier, even though this is fascinating.

From a quick investigation, ICU considers all "numeric" characters in all scripts to sort exactly the same as the Arabic digit, so e.g. 1, ۱, 𒐕, 𐒡, ꘡, and nearly a hundred other characters are all sorted in the same position as 1. You can try this on https://ssl.icu-project.org/icu-bin/collation.html. So why is this problem not occurring for all digits, rather than just 2/𒑖 and 3/𒑗?

It turns out that for all digits other than 2 and 3, there are no other "heading candidates" in the data we use. It looks like the script that generated it filtered them out. So why are these two special?

When 𒑖 and 𒑗 were originally introduced in Unicode 5.0, they did not have a numeric value defined, and so they were sorted separately from 2 and 3. This was corrected in Unicode 7.0. Compare 6.0.0 and 7.0.0 (search for the character names, CUNEIFORM NUMERIC SIGN NIGIDAMIN and CUNEIFORM NUMERIC SIGN NIGIDAESH). The data was originally generated from the older Unicode version and now we started using it with the newer.

Note again that 2/𒑖 and 3/𒑗 are considered identical, but 𒑖 and 𒑗 are chosen as headings just because of the order in which we process the data. I think we should just use the actual codepoint values as tiebreaker, so that U+00031/U+00032 will sort before U+12456/U+12457 and will be chosen as headings.

Change 431746 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] IcuCollation: Use codepoint as tiebreaker when getting first-letters

https://gerrit.wikimedia.org/r/431746

Change 431746 merged by jenkins-bot:
[mediawiki/core@master] IcuCollation: Use codepoint as tiebreaker when getting first-letters

https://gerrit.wikimedia.org/r/431746

It would be cool to have this patch in the 1.31 LTS release

Change 432595 had a related patch set uploaded (by Martineznovo; owner: Bartosz Dziewoński):
[mediawiki/core@REL1_31] IcuCollation: Use codepoint as tiebreaker when getting first-letters

https://gerrit.wikimedia.org/r/432595

The patch will be deployed to Wikimedia wikis this week per the usual
schedule (Tuesday-Thursday), but note that the first-letter data is
additionally cached for up to a week, so it might take a bit longer for the
issue to be fixed. I'll keep this task open until I can verify the fix.

Change 432595 merged by jenkins-bot:
[mediawiki/core@REL1_31] IcuCollation: Use codepoint as tiebreaker when getting first-letters

https://gerrit.wikimedia.org/r/432595