Page MenuHomePhabricator

Improve TextMatchEditCheck performance
Closed, ResolvedPublic

Description

From @Esanders :

With the small set of checks currently setup on enwiki, this is already causing typing lagging on long articles, e.g.
https://en.wikipedia.org/w/index.php?title=Moon&veaction=edit&ecenable=2
Moon - Wikipedia (325 kB)

It takes about 500ms for all the checks to run on my machine, which is too high for an onDocumentChange check

I think the forEachRunOfContent is slowing us down a little bit, but it isn't the bottleneck: 700ms locally, vs 500ms if I just search the entire document. If we just avoid \n in the search result that would have the same effect (not allowing multi-paragraph matches) (edited)
But the main problem is that doing 10 million Set.has calls is slow

Possible approaches:

  • Go back to building a regex - might be hard to support locale case insensitivity - will need to do some escaping (RegExp.escape polyfill)
  • Use Aho-Corasick: https://github.com/BrunoRB/ahocorasick Seems to build character based tables, so could possible be modified for case/local case insensitivity

Testing on the same example as above (local 450k doc with 100 words):

  • Regex: ~20ms
  • Aho-Corasick: ~50ms

Event Timeline

Hmm I think building normalizedQuery is the slow part of ve.dm.Document#findText. On a 90k article, I'm seeing a 5000-element Set search with caseSensitiveString: true being twice as fast as caseSensitiveString: false.

It's also wasteful that we recalculate the matches for every run of text, each time the user changes a single run of text.

Change #1212666 had a related patch set uploaded (by Divec; author: Divec):

[VisualEditor/VisualEditor@master] WIP: Separate out text logic from ve.dm.Document#findText

https://gerrit.wikimedia.org/r/1212666

Hmm I think building normalizedQuery is the slow part of ve.dm.Document#findText. On a 90k article, I'm seeing a 5000-element Set search with caseSensitiveString: true being twice as fast as caseSensitiveString: false.

It's also wasteful that we recalculate the matches for every run of text, each time the user changes a single run of text.

I think these optimisations aren't even necessary if we switch to a concatenated regex, as that runs in 20ms on a massive article.

The above refactor introduces SetTextFinder, which avoids having to rebuild normalizedQuery on every single call to ve.dm.Document#findText. It also introduces MemoizedTextFinder, which avoids rechecking the same runs of text on each document update, making it faster even than the RegExp method. Here are some rough timings on a 90k article:

> function time( callback ) {
>         start = performance.now();
>         callback();
>         return performance.now() - start;
> }     
> doc = ve.init.target.surface.model.documentModel;
> manyWords = new Set( Object.keys( Array( 1e6 ).fill() ) );
Set(1000000) {'0', '1', '2', '3', '4', …}
> time( () => doc.findText( manyWords ) );
1059.0999999940395
> time( () => doc.findText( manyWords, { caseSensitiveString: true } ) );
45.19999999552965
> s = new ve.dm.SetTextFinder( manyWords );
> time( () => doc.findText( s ) );
38.899999998509884
> m = new ve.dm.MemoizedTextFinder( s );
> time( () => doc.findText( m ) );
46.5
> time( () => doc.findText( m ) );
5.899999998509884

I think these optimisations aren't even necessary if we switch to a concatenated regex, as that runs in 20ms on a massive article.

This is true for performance. However RegExp /i doesn't work properly with text in certain languages such as Turkish.

Change #1212666 merged by jenkins-bot:

[VisualEditor/VisualEditor@master] Separate out text logic from ve.dm.Document#findText

https://gerrit.wikimedia.org/r/1212666

Change #1214163 had a related patch set uploaded (by Medelius; author: Medelius):

[mediawiki/extensions/VisualEditor@master] TextMatch: update to use memoization for ve.dm.Document#findText

https://gerrit.wikimedia.org/r/1214163

I think these optimisations aren't even necessary if we switch to a concatenated regex, as that runs in 20ms on a massive article.

This is true for performance. However RegExp /i doesn't work properly with text in certain languages such as Turkish.

I think that could be fixed by just expanding the regex to include locale-lowercase, instead of using /i, e.g. /Match/i => /[Mm][Aa][Tt][Cc][Hh]/

Change #1212591 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (415d0126b)

https://gerrit.wikimedia.org/r/1212591

Change #1212591 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (415d0126b)

https://gerrit.wikimedia.org/r/1212591

EAkinloose edited projects, added Verified; removed Editing QA.

Change #1214163 abandoned by DLynch:

[mediawiki/extensions/VisualEditor@master] TextMatch: update to use memoization for ve.dm.Document#findText

Reason:

Merged as part of I6c36cf45143d4f2eda1a27c597707ce007fabe5e

https://gerrit.wikimedia.org/r/1214163