Change Details

On English Wiktionary, we've been using the Lua less-than operator to sort lists of words in templates powered by [[https://en.wiktionary.org/wiki/Module:columns|Module:columns]]. Today, it was [[https://en.wiktionary.org/wiki/Module_talk:columns#Broken_sort_with_Gothic|reported]] that Gothic words do not sort correctly. This is because Module:columns uses the `<` operator (after processing the words), and the Lua comparison operators `<`, `>`, `<=`, `>=` do not work correctly for words containing codepoints in the Supplementary Multilingual Plane. They treat all SMP codepoints as equal. (The Basic Multilingual Plane seems to be compared correctly.) The list of Gothic words was as follows: ``` local words = { "𐌰𐍄𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌰𐍆𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌱𐌹𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌲𐌰𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌹𐌽𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌿𐍃𐍅𐌰𐌽𐌳𐌾𐌰𐌽", } ```` This is the correct, alphabetical, order, and it is the order that you will get if you sort the table using a function that compares the strings byte-by-byte or codepoint-by-codepoint. But calling `table.sort(words)` in Scribunto results in a different order. (`table.sort` by default uses a comparison function roughly equivalent to `function (item1, item2) return item1 < item2 end`.) Now the characters in these words are in the Gothic block (U+10330-1034F), in the SMP. (They may not display correctly for everyone viewing this post.) It seems that, for words consisting of codepoints in the SMP, the `<` and `>` operators return `false` and `<=` and `>=` return true, for either ordering. The `<` operator is used by `table.sort`, and by the comparison function in Module:columns. This is, obviously, incorrect. The polarity of the result should change when the order of the operands changes. I substantiated my suspicion that this bug affected the SMP with roughly the following code. It constructs an array of fake words that start at consecutive codepoints and then compares each word to the next word and vice-versa. I printed the result on a module documentation page. (See a previous revision of [[https://en.wiktionary.org/w/index.php?diff=49391815|Module:sandbox]] on English Wiktionary.) I also tested the other comparison operators (`>`, `<=`, `>=`) by making relevant modifications to the code. ``` local words = {} local function make_words(words, start_cp, end_cp) local i = 0 for cp = start_cp, end_cp do i = i + 1 str = '' for cp = cp, cp + 5 do str = str .. mw.ustring.char(cp) end words[i] = str end end local function show() local output = {} local i = 0 function output.add(...) i = i + 1 output[i] = table.concat({...}, "\t") -- like print or mw.log end function show(word1, word2) output.add(word1, " < ", word2, ":", tostring(word1 < word2)) end for i = 1, #words - 1 do local word1, word2 = words[i], words[i + 1] show(word1, word2) word1, word2 = word2, word1 show(word1, word2) end return table.concat(output, "<br>") end ``` I tested the SMP by generating a fake word list starting at codepoint U+10000 and ending at U+10010 (the first 16 codepoints of the Linear-B block) and then the BMP with the U+FE70-FE7F (Arabic Presentation Forms-B). The bug surfaced for Linear-B but not for the Arabic Presentation Forms-B. It seems that the implementation of the comparison operators is different from vanilla Lua. Perhaps it parses the string into an array of 16-bit unsigned integers (or the equivalent thereof), which has the range 0 to 0xFFFF, and that any codepoints greater than 0xFFFF (that is, codepoints in the SMP) are changed to 0xFFFF, and these integers are then compared numerically in order. So, the comparison of strings that only contain codepoints in the SMP gives the same result as the comparison of 0xFFFF with 0xFFFF: as `0xFFFF < 0xFFFF` is false, so are `SMP_string1 < SMP_string2` and `SMP_string2 < SMP_string1`. Similarly with the other comparison operators, except for `==`, of course. But I don't understand what is happening when strings containing a mixture of SMP and BMP characters are compared. If I try the above code in the Lua interpreters on my computer (versions 5.3.4 and 5.1.5), the strings compare correctly. There, a simple byte-by-byte comparison is used. For the UTF-8 encoding, used by MediaWiki, byte-by-byte comparison gives the same result as codepoint-by-codepoint. If Scribunto Lua did the same, it would fix the bug. But I guess there was a reason why that solution wasn't used? --- In summary, can string comparison be done codepoint-by-codepoint (or byte-by-byte, which gives the same result) for characters in the SMP? We can implement it in Lua, as discussed on the talk page of Module:columns, but that shouldn't be necessary. (And doing it in Lua is more memory- and time-intensive and may push some pages over the memory limit because of English Wiktionary's intensive use of Lua. But Module:columns did use a Lua-based comparison function before I changed it to use the `<` operator.) As a side note, I'm curious how the comparison is implemented and why it behaves the way it does for characters in the SMP.

//Upstream bug is (probably) https://sourceware.org/bugzilla/show_bug.cgi?id=21302 // On English Wiktionary, we've been using the Lua less-than operator to sort lists of words in templates powered by [[https://en.wiktionary.org/wiki/Module:columns|Module:columns]]. Today, it was [[https://en.wiktionary.org/wiki/Module_talk:columns#Broken_sort_with_Gothic|reported]] that Gothic words do not sort correctly. This is because Module:columns uses the `<` operator (after processing the words), and the Lua comparison operators `<`, `>`, `<=`, `>=` do not work correctly for words containing codepoints in the Supplementary Multilingual Plane. They treat all SMP codepoints as equal. (The Basic Multilingual Plane seems to be compared correctly.) The list of Gothic words was as follows: ``` local words = { "𐌰𐍄𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌰𐍆𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌱𐌹𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌲𐌰𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌹𐌽𐍅𐌰𐌽𐌳𐌾𐌰𐌽", "𐌿𐍃𐍅𐌰𐌽𐌳𐌾𐌰𐌽", } ```` This is the correct, alphabetical, order, and it is the order that you will get if you sort the table using a function that compares the strings byte-by-byte or codepoint-by-codepoint. But calling `table.sort(words)` in Scribunto results in a different order. (`table.sort` by default uses a comparison function roughly equivalent to `function (item1, item2) return item1 < item2 end`.) Now the characters in these words are in the Gothic block (U+10330-1034F), in the SMP. (They may not display correctly for everyone viewing this post.) It seems that, for words consisting of codepoints in the SMP, the `<` and `>` operators return `false` and `<=` and `>=` return true, for either ordering. The `<` operator is used by `table.sort`, and by the comparison function in Module:columns. This is, obviously, incorrect. The polarity of the result should change when the order of the operands changes. I substantiated my suspicion that this bug affected the SMP with roughly the following code. It constructs an array of fake words that start at consecutive codepoints and then compares each word to the next word and vice-versa. I printed the result on a module documentation page. (See a previous revision of [[https://en.wiktionary.org/w/index.php?diff=49391815|Module:sandbox]] on English Wiktionary.) I also tested the other comparison operators (`>`, `<=`, `>=`) by making relevant modifications to the code. ``` local words = {} local function make_words(words, start_cp, end_cp) local i = 0 for cp = start_cp, end_cp do i = i + 1 str = '' for cp = cp, cp + 5 do str = str .. mw.ustring.char(cp) end words[i] = str end end local function show() local output = {} local i = 0 function output.add(...) i = i + 1 output[i] = table.concat({...}, "\t") -- like print or mw.log end function show(word1, word2) output.add(word1, " < ", word2, ":", tostring(word1 < word2)) end for i = 1, #words - 1 do local word1, word2 = words[i], words[i + 1] show(word1, word2) word1, word2 = word2, word1 show(word1, word2) end return table.concat(output, "<br>") end ``` I tested the SMP by generating a fake word list starting at codepoint U+10000 and ending at U+10010 (the first 16 codepoints of the Linear-B block) and then the BMP with the U+FE70-FE7F (Arabic Presentation Forms-B). The bug surfaced for Linear-B but not for the Arabic Presentation Forms-B. It seems that the implementation of the comparison operators is different from vanilla Lua. Perhaps it parses the string into an array of 16-bit unsigned integers (or the equivalent thereof), which has the range 0 to 0xFFFF, and that any codepoints greater than 0xFFFF (that is, codepoints in the SMP) are changed to 0xFFFF, and these integers are then compared numerically in order. So, the comparison of strings that only contain codepoints in the SMP gives the same result as the comparison of 0xFFFF with 0xFFFF: as `0xFFFF < 0xFFFF` is false, so are `SMP_string1 < SMP_string2` and `SMP_string2 < SMP_string1`. Similarly with the other comparison operators, except for `==`, of course. But I don't understand what is happening when strings containing a mixture of SMP and BMP characters are compared. If I try the above code in the Lua interpreters on my computer (versions 5.3.4 and 5.1.5), the strings compare correctly. There, a simple byte-by-byte comparison is used. For the UTF-8 encoding, used by MediaWiki, byte-by-byte comparison gives the same result as codepoint-by-codepoint. If Scribunto Lua did the same, it would fix the bug. But I guess there was a reason why that solution wasn't used? --- In summary, can string comparison be done codepoint-by-codepoint (or byte-by-byte, which gives the same result) for characters in the SMP? We can implement it in Lua, as discussed on the talk page of Module:columns, but that shouldn't be necessary. (And doing it in Lua is more memory- and time-intensive and may push some pages over the memory limit because of English Wiktionary's intensive use of Lua. But Module:columns did use a Lua-based comparison function before I changed it to use the `<` operator.) As a side note, I'm curious how the comparison is implemented and why it behaves the way it does for characters in the SMP.