Page MenuHomePhabricator

Fix comparison of strings containing codepoints outside of the BMP
Open, Needs TriagePublic

Description

Upstream bug is (probably) https://sourceware.org/bugzilla/show_bug.cgi?id=21302

On English Wiktionary, we've been using the Lua less-than operator to sort lists of words in templates powered by Module:columns. Today, it was reported that Gothic words do not sort correctly. This is because Module:columns uses the < operator (after processing the words), and the Lua comparison operators <, >, <=, >= do not work correctly for words containing codepoints outside of the Basic Multilingual Plane (for instance, characters in the Supplementary Multilingual Plane). They treat all codepoints outside of the BMP as equal. (The BMP seems to be compared correctly.)

The list of Gothic words was as follows:

local words = {
	"πŒ°π„π…πŒ°πŒ½πŒ³πŒΎπŒ°πŒ½",
	"πŒ°π†π…πŒ°πŒ½πŒ³πŒΎπŒ°πŒ½",
	"πŒ±πŒΉπ…πŒ°πŒ½πŒ³πŒΎπŒ°πŒ½",
	"πŒ²πŒ°π…πŒ°πŒ½πŒ³πŒΎπŒ°πŒ½",
	"πŒΉπŒ½π…πŒ°πŒ½πŒ³πŒΎπŒ°πŒ½",
	"πŒΏπƒπ…πŒ°πŒ½πŒ³πŒΎπŒ°πŒ½",
}
`

This is the correct, alphabetical, order, and it is the order that you will get if you sort the table using a function that compares the strings byte-by-byte or codepoint-by-codepoint. But calling table.sort(words) in Scribunto results in a different order. (table.sort by default uses a comparison function roughly equivalent to function (item1, item2) return item1 < item2 end.)

Now the characters in these words are in the Gothic block (U+10330-1034F), in the SMP. (They may not display correctly for everyone viewing this post.) It seems that, for words consisting of codepoints outside of the BMP, the < and > operators return false and <= and >= return true, for either ordering. The < operator is used by table.sort, and by the comparison function in Module:columns. This is, obviously, incorrect. The polarity of the result should change when the order of the operands changes.

I substantiated my suspicion that this bug affected characters outside of the BMP with roughly the following code. It constructs an array of fake words that start at consecutive codepoints and then compares each word to the next word and vice-versa. I printed the result on a module documentation page. (See a previous revision of Module:sandbox on English Wiktionary.) I also tested the other comparison operators (>, <=, >=) by making relevant modifications to the code.

local words = {}

local function make_words(words, start_cp, end_cp)
	local i = 0
	for cp = start_cp, end_cp do
		i = i + 1
		str = ''
		for cp = cp, cp + 5 do
			str = str .. mw.ustring.char(cp)
		end
		words[i] = str
	end
end
local function show()
	local output = {}
	local i = 0
	function output.add(...)
		i = i + 1
		output[i] = table.concat({...}, "\t") -- like print or mw.log
	end
	
	function show(word1, word2)
		output.add(word1, " < ", word2, ":", tostring(word1 < word2))
	end
		
	for i = 1, #words - 1 do
		local word1, word2 = words[i], words[i + 1]
		show(word1, word2)
		
		word1, word2 = word2, word1
		show(word1, word2)
	end
	
	return table.concat(output, "<br>")
end

I tested the SMP by generating a fake word list starting at codepoint U+10000 and ending at U+10010 (the first 16 codepoints of the Linear-B block) and then the BMP with the U+FE70-FE7F (Arabic Presentation Forms-B). The bug surfaced for Linear-B but not for the Arabic Presentation Forms-B.

It seems that the implementation of the comparison operators is different from vanilla Lua. Perhaps it parses the string into an array of 16-bit unsigned integers (or the equivalent thereof), which has the range 0 to 0xFFFF, and that any codepoints greater than 0xFFFF are changed to 0xFFFF, and these integers are then compared numerically in order. So, the comparison of strings that only contain codepoints in the SMP gives the same result as the comparison of 0xFFFF with 0xFFFF: as 0xFFFF < 0xFFFF is false, so are SMP_string1 < SMP_string2 and SMP_string2 < SMP_string1. Similarly with the other comparison operators, except for ==, of course. But I don't understand what is happening when strings containing a mixture of SMP and BMP characters are compared.

If I try the above code in the Lua interpreters on my computer (versions 5.3.4 and 5.1.5), the strings compare correctly. There, a simple byte-by-byte comparison is used. For the UTF-8 encoding, used by MediaWiki, byte-by-byte comparison gives the same result as codepoint-by-codepoint. If Scribunto Lua did the same, it would fix the bug. But I guess there was a reason why that solution wasn't used?


In summary, can string comparison be done codepoint-by-codepoint (or byte-by-byte, which gives the same result) for characters outside of the BMP? We can implement it in Lua, as discussed on the talk page of Module:columns, but that shouldn't be necessary. (And doing it in Lua is more memory- and time-intensive and may push some pages over the memory limit because of English Wiktionary's intensive use of Lua. But Module:columns did use a Lua-based comparison function before I changed it to use the < operator.)

As a side note, I'm curious how the comparison is implemented and why it behaves the way it does for characters outside of the BMP.

Event Timeline

String comparisons are done in Lua using the strcoll() function, which depends on the locale defined for the process. In MediaWiki we typically use either C.UTF-8 or a national variant such as en_US.UTF-8 (Wikimedia wikis use the former).

It appears that the glibc C.UTF-8 locale doesn't handle collation of code points above U+FFFF properly. At least in the case of the words provided here, it seems to consider them all equal. The upstream bug for this seems to be https://sourceware.org/bugzilla/show_bug.cgi?id=21302.

For future reference, here's a small C program that tests this bug.

#include <stdio.h>
#include <string.h>
#include <locale.h>

const char *foo[] = {
    "𐌰",
    "𐌱",
    "𐌲",
    "𐌹",
    "𐌿",
};

int main( void ) {
    int i, j;

    setlocale( LC_ALL, "" );
    printf( "LC_COLLATE = %s\n", setlocale( LC_COLLATE, NULL ) );

    for( i = 0; i < 5; i++ ) {
        for( j = i+1; j < 5; j++ ) {
            printf( "%s <=> %s => %d\n", foo[i], foo[j], strcoll( foo[i], foo[j] ) );
        }
    }

    return 0;
}
β€’ Vvjjkkii renamed this task from Fix comparison of strings containing codepoints in the Supplementary Multilingual Plane to j7daaaaaaa.Jul 1 2018, 1:14 AM
β€’ Vvjjkkii triaged this task as High priority.
β€’ Vvjjkkii updated the task description. (Show Details)
β€’ Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from j7daaaaaaa to Fix comparison of strings containing codepoints in the Supplementary Multilingual Plane.Jul 1 2018, 7:47 AM
Mainframe98 raised the priority of this task from High to Needs Triage.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.
Erutuon renamed this task from Fix comparison of strings containing codepoints in the Supplementary Multilingual Plane to Fix comparison of strings containing codepoints outside of the BMP.Jul 1 2018, 8:00 AM
Erutuon updated the task description. (Show Details)