Steps to reproduce:
- Open VE standalone and edit just the single word İstanbul (where the first character is U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE).
- In the console, try ve.dm.Document#findText with the string 'stan' as a case-insensitive query:
> ve.init.target.surface.model.documentModel.findText( 'stan' )[ 0 ].start 2
- Note that the start offset is 2.
- Now instead try it with a Set as the query, instead of a string:
> ve.init.target.surface.model.documentModel.findText( new Set( [ 'stan' ] ) )[ 0 ].start
Expected response
The start offset should still be 2.
Actual response
The start offset is 3.
Cause
This is because offsets are being calculated on the lower-cased string, but lower casing has changed the string length (the single codepoint U+0130 gets lower cased to two Unicode codepoints U+0069 U+0307).
Note that a related problem actually affects string queries too. The code for case-insensitive string queries performs a comparison character-by-character, and so this will fail whenever lower casing the search term changes the string length. For example, findText( 'İstanbul'.toLowerCase() ) won't match İstanbul.
Proposed solution
Looking at this with @medelius , we noted it's not trivially easy to avoid this issue and still allow case insensitive search:
- An offset on the case-folded string can differ from the equivalent offset on the original string (which is what we actually need).
- We can't readily map back from the case-folded offset to the original offset.
- We can't test character-by-character for a case-insensitive match, because case folding can be sensitive to the context of surrounding characters (it's not a pure function on individual code units).
- It's hard to isolate substrings to test against a set (with .has) or a string (with Intl.Collator#compare), because we can't readily know how long that substring should be.
We did find a way to build a correct offset map:
function mapRange( s, range ) {
const map = new Map();
for ( let i = 0; i < s.length; i++ ) {
const substr = s.slice( 0, i );
const j = substr.toLowerCase().length;
// Write the entry for this offset. If there's a prior entry, overwrite it
map.set( j, i );
}
return { start: map.get( range.start ), end: map.get( range.end ) };
}Naively coded, it involves calling toLowerCase n times for a string of length n. That might be acceptable performance because paragraphs shouldn't be enormous. Or we could make this more efficient, e.g. choosing not to calculate mappings for offsets we're sure we don't need.
A real implementation should likely use .toLocaleLowerCase( this.lang ). But note you would still encounter cross-language case folding, e.g. when the string 'İstanbul' appears on a wiki whose article language is English.