Page MenuHomePhabricator

Investigation: New sort order: Hyphenated words should be sorted lower than the prefix
Open, MediumPublic

Description

This is an investigation card for Danny/Johan -- have a conversation with folks on Village pump to see if we should change anything. Let's compare the uppercase sort to the uca-collation sort. Danny's guess is that this is a lateral move. Ryan suggests remove all commas from defaultsorts (or just ignore them).


User:GrahamHardy pointed out on the Village pump that with the new sort order, hyphenated names come before a name that just has the prefix.

For example, on Category:Living people, the sort currently looks like this:

  • Jessica Ennis-Hill
  • Delloreen Ennis-London
  • Andy Ennis
  • Bruce Ennis
  • Chris Ennis, Jr.

I'd expect it to sort the hyphenated names after the others, like this:

  • Andy Ennis
  • Bruce Ennis
  • Chris Ennis, Jr.
  • Jessica Ennis-Hill
  • Delloreen Ennis-London

I did a test on Meta to see what the old collation does, and found an interesting result:

  • Jessica Ennis Hill
  • Bruce Ennis Jones
  • Andy Ennis
  • Bruce Ennis
  • Chris Ennis, Jr.
  • Jessica Ennis-Hill
  • Delloreen Ennis-London

Ennis-Hill and Ennis-London are sorted in the place that I would expect, but I would also have expected Bruce Ennis Jones to come below Bruce Ennis.

@kaldari, let's check in about this next week? I'm not sure if anything can be done, but I'd like to understand the sort rules.

ennis test.jpg (494×1 px, 114 KB)

Event Timeline

Just as an aside, hypanation behaviour is configurable in the new sort algorithm (but we are probably not going to change it unless it really bothers people as it would require rerunning the maintenance script)

This doesn't seem quite right. "Ennis" should come before "Ennis Jones" and before "Ennis-Jones". Not to make this too circular, but quoting https://en.wikipedia.org/wiki/Alphabetical_order, "If the first letters are the same, then the second letters are compared, and so on. If a position is reached where one string has no more letters to compare while the other does, then the first (shorter) string is deemed to come first in alphabetical order." Or as my librarian in grade three taught us, "Nothing comes before something."

I hope this is an easy tweak.

Copying the discussion from the village pump:

DannyH:

I just did a test on Meta-Wiki, which still has the original sorting, and it does put Andy Ennis before Jessica Ennis-Hill, as we would expect. But -- I also discovered in that test that the original sorting does this:

  • Jessica Ennis Hill
  • Andy Ennis
  • Bruce Ennis
  • Jessica Ennis-Hill

I'm not sure what that does in the new sorting.

Redrose64:

What should be done is not to look at the article name, but the sortkey that is used. For these four cases, the sortkeys will be:

  • Ennis Hill, Jessica
  • Ennis, Andy
  • Ennis, Bruce
  • Ennis-Hill, Jessica

It is known that space collates before all other characters, and that common punctuation collates before letters. So the question is: does the hyphen-minus character collate before or after the comma?

Bawolff:

The algorithm we are using (UCA) has several config options related to punctuation. In the default config ("non-ignorable" which we are using), punctuation characters are considered to be like letters, and the order goes " " (space) then "-" then ",".

Redrose64:

This will give:

  • Ennis Hill, Jessica
  • Ennis-Hill, Jessica
  • Ennis, Andy
  • Ennis, Bruce

which answers the original q.

According to https://ssl.icu-project.org/icu-bin/collation.html, it should be sorting as such under uca-default:

  1. Ennis
  2. Ennis Hill
  3. Ennis Jones
  4. Ennis-Hill
  5. Ennis-London
  6. Ennis, Jr.

Simplifying a bit, I think the sort order that we're aiming for is:

  • Smith, Allison
  • Smith, Bob
  • Smith Jones, Allison
  • Smith Jones, Bob
  • Smith Jones, Jr., Bob
  • Smith-Jones, Allison
  • Smith-Jones, Bob
  • Smith-Jones, Jr., Bob

which would give us:

  • Allison Smith
  • Bob Smith
  • Allison Smith Jones
  • Bob Smith Jones
  • Bob Smith Jones, Jr.
  • Allison Smith-Jones
  • Bob Smith-Jones
  • Bob Smith-Jones, Jr.

The reason it is mis-sorting them is that those articles are actually usuaing "Ennis," in the DEFAULTSORT keys. UCA sorts hyphens before commas, so it sorts "Ennis," after "Ennis-". All that is required to fix this is to remove all the commas from the DEFAULTSORT keys.

DannyH renamed this task from New sort order: Hyphenated words should be sorted lower than the prefix to Investigation: New sort order: Hyphenated words should be sorted lower than the prefix.Sep 6 2016, 10:26 PM
DannyH updated the task description. (Show Details)

Possible solutions include:

  1. Get the English Wikipedia to stop using commas in DEFAULTSORT keys
  2. Make the DEFAULTSORT parser function strip out commas before creating sort keys
  3. Figure out a way to configure the UCA collation to sort commas before hyphens and regenerate all the sort keys (may not be possible)

Of those, #1 is my preferred option, followed by #2. #3 would be a last resort, IMO.

We also need to keep in mind legitimate uses of commas like https://en.wikipedia.org/wiki/3,2,1..._Frankie_Go_Boom. In that case ignoring commas would cause it to be sorted as "321" which would be incorrect.

#2 seems kind of hacky to me.

Figure out a way to configure the UCA collation to sort commas before hyphens and regenerate all the sort keys (may not be possible)

The PHP bindings don't let us add random additional rules (which is rather sad), but en-US-u-va-posix [in the icu demo page. MW uses different name] does come kind of close (Spaces still seem to come before commas, but commas come before hyphens). That said, I'm unsure if it would be a good idea to switch to that.