Page MenuHomePhabricator

ve.dm.Document#findText corrupts offsets if toLowerCase changes string length
Closed, ResolvedPublicBUG REPORT

Description

Steps to reproduce:

  1. Open VE standalone and edit just the single word İstanbul (where the first character is U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE).
  2. In the console, try ve.dm.Document#findText with the string 'stan' as a case-insensitive query:
> ve.init.target.surface.model.documentModel.findText( 'stan' )[ 0 ].start
2
  1. Note that the start offset is 2.
  2. Now instead try it with a Set as the query, instead of a string:
> ve.init.target.surface.model.documentModel.findText( new Set( [ 'stan' ] ) )[ 0 ].start

Expected response
The start offset should still be 2.

Actual response
The start offset is 3.

Cause

This is because offsets are being calculated on the lower-cased string, but lower casing has changed the string length (the single codepoint U+0130 gets lower cased to two Unicode codepoints U+0069 U+0307).

Note that a related problem actually affects string queries too. The code for case-insensitive string queries performs a comparison character-by-character, and so this will fail whenever lower casing the search term changes the string length. For example, findText( 'İstanbul'.toLowerCase() ) won't match İstanbul.

Proposed solution

Looking at this with @medelius , we noted it's not trivially easy to avoid this issue and still allow case insensitive search:

  • An offset on the case-folded string can differ from the equivalent offset on the original string (which is what we actually need).
  • We can't readily map back from the case-folded offset to the original offset.
  • We can't test character-by-character for a case-insensitive match, because case folding can be sensitive to the context of surrounding characters (it's not a pure function on individual code units).
  • It's hard to isolate substrings to test against a set (with .has) or a string (with Intl.Collator#compare), because we can't readily know how long that substring should be.

We did find a way to build a correct offset map:

function mapRange( s, range ) {
        const map = new Map();

        for ( let i = 0; i < s.length; i++ ) {
                const substr = s.slice( 0, i );
                const j = substr.toLowerCase().length;
                // Write the entry for this offset. If there's a prior entry, overwrite it
                map.set( j, i );
        }
        return { start: map.get( range.start ), end: map.get( range.end ) };
}

Naively coded, it involves calling toLowerCase n times for a string of length n. That might be acceptable performance because paragraphs shouldn't be enormous. Or we could make this more efficient, e.g. choosing not to calculate mappings for offsets we're sure we don't need.

A real implementation should likely use .toLocaleLowerCase( this.lang ). But note you would still encounter cross-language case folding, e.g. when the string 'İstanbul' appears on a wiki whose article language is English.

Event Timeline

Change #1197732 had a related patch set uploaded (by Divec; author: Divec):

[VisualEditor/VisualEditor@master] Document failure case where ve.dm.Document#findText corrupts offsets

https://gerrit.wikimedia.org/r/1197732

Change #1200430 had a related patch set uploaded (by Medelius; author: Medelius):

[VisualEditor/VisualEditor@master] ve.dm.Document: use offset map in findText

https://gerrit.wikimedia.org/r/1200430

Change #1197732 merged by jenkins-bot:

[VisualEditor/VisualEditor@master] Document failure case where ve.dm.Document#findText corrupts offsets

https://gerrit.wikimedia.org/r/1197732

Change #1200430 merged by jenkins-bot:

[VisualEditor/VisualEditor@master] ve.dm.Document: use offset map in findText

https://gerrit.wikimedia.org/r/1200430

Change #1202328 had a related patch set uploaded (by Medelius; author: Medelius):

[VisualEditor/VisualEditor@master] WIP: Create MVP of findText using intl.collator

https://gerrit.wikimedia.org/r/1202328

Change #1202330 had a related patch set uploaded (by Medelius; author: Medelius):

[mediawiki/extensions/VisualEditor@master] WIP: Create MVP of findText using intl.collator

https://gerrit.wikimedia.org/r/1202330

Change #1204385 had a related patch set uploaded (by Divec; author: Divec):

[VisualEditor/VisualEditor@master] Document failure case in ve.dm.Document#findText

https://gerrit.wikimedia.org/r/1204385

Change #1204953 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (24aa41f95)

https://gerrit.wikimedia.org/r/1204953

Change #1204953 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (c4976c6c9)

https://gerrit.wikimedia.org/r/1204953

Change #1204385 merged by jenkins-bot:

[VisualEditor/VisualEditor@master] Document failure case in ve.dm.Document#findText

https://gerrit.wikimedia.org/r/1204385

EAkinloose edited projects, added Verified; removed Editing QA.
EAkinloose subscribed.

Tested on en.beta(not reproducible) and enwiki(reproducible) with characters like 'İ', 'Ğ', 'Ş'.

Change #1212557 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (4b7ec1130)

https://gerrit.wikimedia.org/r/1212557

Change #1212557 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (4b7ec1130)

https://gerrit.wikimedia.org/r/1212557