Page MenuHomePhabricator

mw.widgets.visibleByteLimit doesn't truncate correctly
Closed, ResolvedPublic

Description

  1. Open the VE save dialog (or another UI with an input that has a byte limit counter set to an odd number)
  2. Copy the Canadian flag emoji (🇨🇦). Note that this is really two Unicode code points, U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C and U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A. Both of these characters are 4 bytes (both in UTF-8 and UTF-16), so the whole flag is 8 bytes.
  3. Repeatedly paste it into the edit summary input until it's full

Expected behavior: the input contains 255/8 = 31 Canadian flags, and there are 255-31*8 = 7 bytes left on the counter
Actual behavior: the input contains 31 Canadian flags, then a "C" flag character, then a garbage character, and there is 1 byte left on the counter

Note that not only did the flag get truncated between the "C" and the "A" (which I could see being hard to avoid), but the "A" got truncated to a 2-byte garbage character. This garbage character is D83C, which is the first half of the UTF-16 representation of the flag "A" (D83C DDE6).

What it looks like for me in Firefox (with EmojiOne's godawfully ugly flags):

Screenshot from 2018-02-02 13-24-11.png (116×478 px, 5 KB)

And in Chrome (where I don't have flag emoji support):

Screenshot from 2018-02-02 13-22-56.png (124×482 px, 9 KB)

I don't know if VE uses the same bytelimit implementation as MW core, but I tested the byte limit introduced by https://gerrit.wikimedia.org/r/#/c/407713/1 and it behaves the same way.

Event Timeline

I know filling the summary field with flag emoji is a bit silly, but the truncation bug turning the flag 'A' into a garbage character probably affects all non-BMP codepoints

I can confirm that you don't have to have a complex grapheme cluster like a flag to break this: it also happens with single non-BMP codepoints, such as U+282E2 (𨋢).

The most severe problem is $.trimByteLength using slice( 0, -1 ) (in jquery.byteLimit.js) which will indiscriminately chop a surrogate pair in half, leaving really broken unicode text. This can be fixed using String#codePointAt or by manually checking for code units in the range 0xD800-0xDFFF.

However, as you point out, even after fixing this it will be possible to split a multi-codepoint grapheme cluster which will do weird things to Indian text, say.

Moreover, your flag example raises another issue: the truncation happens at an heuristically determined offset (essentially based on a diff). This may not be where you actually inserted text, and it may cause spurious grapheme clusters to form. Vivid example:

  1. Copy 🇨🇦🇿🇼 (C,A,Z,W regional indicator symbols, i.e. CA flag plus ZW flag) from this sentence and paste it into the edit summary input.
  2. Type 'q' until the box reaches maxLength.
  3. Copy 🇨🇳🇹🇷🇧🇼 (C,N,T,R,B,W regional indicator symbols, i.e. CN flag plus TR flag plus BW flag) from this sentence.
  4. Select the C,A,Z,W regional indicators, then paste.

Expected behaviour: Either the change is rejected outright, or the paste text is truncated (so the B,W regional indicator symbols disappear)
Actual behaviour: The diff sees N,T,R,B regional indicator symbols as the insertion, and truncates the R,B regional indicator symbols ... leaving 🇨🇳🇹🇼 (C,N,T,W regional indicator symbols, i.e. CN flag plus TW flag).

That seems problematic, especially as in real-life paste examples it could cause subtle but important changes of meaning far from your cursor.

That's amazing. What makes it even better is that while the input listens to key events, it doesn't appear to listen to the paste event, so if you paste using the right-click context menu, the truncation doesn't happen until you type into the input or blur it.

Video of this bug using Ctrl+C and Ctrl+V:

Video of this bug using right-click copy and paste:

The most severe problem is $.trimByteLength using slice( 0, -1 ) (in jquery.byteLimit.js) which will indiscriminately chop a surrogate pair in half, leaving really broken unicode text. This can be fixed using String#codePointAt or by manually checking for code units in the range 0xD800-0xDFFF.

Right, this is the real problem here.

Chopping off the "A" off "🇨🇦" when it is pasted into the input seems like a hard problem. Note that at least Chrome also behaves "incorrectly": pasting "🇨🇦" into an <input maxwidth=3> will leave just the "🇨". I think handling grapheme clusters is out of scope.

Moreover, your flag example raises another issue: the truncation happens at an heuristically determined offset (essentially based on a diff). This may not be where you actually inserted text, and it may cause spurious grapheme clusters to form.

Indeed, that is unfortunate, but it was a much easier way to implement this limit. It is much easier to wait for the browser to insert whatever text, than to handle the keyboard events "manually" and try to figure out what characters they would insert. I don't think it's feasible to do this differently.

That's amazing. What makes it even better is that while the input listens to key events, it doesn't appear to listen to the paste event, so if you paste using the right-click context menu, the truncation doesn't happen until you type into the input or blur it.

We definitely handle the 'paste' event. I just tried debugging this and it seems that the input's value is not updated until the event handler returns. I don't know if that is the expected behavior, but this deserves a new task.

Chopping off the "A" off "🇨🇦" when it is pasted into the input seems like a hard problem. Note that at least Chrome also behaves "incorrectly": pasting "🇨🇦" into an <input maxwidth=3> will leave just the "🇨". I think handling grapheme clusters is out of scope.

@Esanders says that the UnicodeJS library has methods to handle this, I did not test whether he's lying to me. Also that library is like 80k of code+data.

Yeah - it's wrapped up in ve.graphemeSafeSubstring. We currently bundle unicodeJS as one package, but it could be split up if that made sense.

Change 408326 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] jquery.byteLimit: Handle characters outside BMP (surrogate pairs) when trimming

https://gerrit.wikimedia.org/r/408326

This can be fixed using String#codePointAt or by manually checking for code units in the range 0xD800-0xDFFF.

Note that String#codePointAt is an ECMAScript 6 method, and not all browsers we support have it. So I just went with some regexps.

Screenshot from 2018-02-02 13-24-11.png (116×478 px, 5 KB)

If this is the summary, what was the edit? :)

With Bartosz's patch it now looks like this, which is an improvement:

image.png (59×82 px, 1 KB)

To avoid this completely we'd need the UnicodeJS's grapheme break.

A more realistic failure would be separating a letter form its combining accent.

If we broke up the UnicodeJS lib it would be more like 28K (9k zipped) for grapheme breaks:

 82K unicodejs.js
6.9K unicodejs.base.js
 29K unicodejs.characterclass.js
 21K unicodejs.graphemebreak.js
 26K unicodejs.wordbreak.js
----
 28K graphemebreak.and.base.js
8.8K graphemebreak.and.base.js.gz

Change 408326 merged by jenkins-bot:
[mediawiki/core@master] jquery.byteLimit: Handle characters outside BMP (surrogate pairs) when trimming

https://gerrit.wikimedia.org/r/408326

matmarex triaged this task as Medium priority.
matmarex removed projects: Patch-For-Review, OOUI.

That's amazing. What makes it even better is that while the input listens to key events, it doesn't appear to listen to the paste event, so if you paste using the right-click context menu, the truncation doesn't happen until you type into the input or blur it.

We definitely handle the 'paste' event. I just tried debugging this and it seems that the input's value is not updated until the event handler returns. I don't know if that is the expected behavior, but this deserves a new task.

Turns out it already has a task: T64319: jquery.byteLimit doesn't work when the content is pasted using the mouse.