At present, the VisualEditor treats UTF-16 code points as if they were synonymous with abstract characters. Here are two cases where this causes bugs:
- UTF-16 uses a surrogate pair to represent each Unicode character above U+FFFF. For instance, U+282E2 ('elevator' in Cantonese) is a single character represented in Javascript as "\uD860\uDEE2". In a plain textarea, this behaves like a single character from the point of view of the user. However in the VisualEditor, cursoring and backspacing requires two presses; and after cursoring once, any text typed will go in the middle of the surrogate pair, creating invalid UTF-16. (see The Unicode Standard, Version 6.2, Section 3.8, Surrogates).
- Combining accents can be used in sequences to build up abstract characters. For example, the Javascript string "m\u0300" represents a single abstract character (m with grave accent). In a plain textarea, this behaves like a single character when cursoring, but like two characters when backspacing (so the first backspace just removes the accent). However in the VisualEditor, cursoring requires two presses; and after cursoring once, any typed text will go between the letter and the accent, creating an inappropriate dangling combining accent.
These kinds of issues occur because the DataModel uses Arrays with code point elements, say ['\uD860', '\uDEE2', ..., 'm', '\u0300']). My hunch is that this is slightly too low level, and it should instead use abstract character elements, say ['\uD860\uDEE2', ..., 'm\u0300'], where each element represents a whole character.
A good start would be to abstract out away calls to string.split( '' ) into a single function like this:
ve.splitCharacters = function ( value ) { return value.split( /(?![\uDC00-\uDFFF])/ ); // don't split surrogate pairs };
The rest of the codebase should call this function to perform splits, and then not assume that data[i] is a single character. Then we can refine splitCharacters as needed.
Alternatively, since the overwhelming majority of characters will in fact be single code points, perhaps the DataModel structure could "encode" the exceptional multi-code point characters as objects, so that 'typeof data[i] === "string"' can still detect the simple cases.
This sounds like a big change for a small issue, but I think it would avoid problems in the future. With a character representation, you can safely perform useful operations like splicing and truncating without having to check the surrounding context very carefully every time.
Version: unspecified
Severity: normal