Page MenuHomePhabricator

Create a heuristic for finding a plain text paragraph in VE
Closed, ResolvedPublic

Description

If we go with the approach of having the backlog of the Improve Tone Suggested Edit be generated by the ML team, then those backlog items will contain the plain text of the paragraphed as produced by mwparserfromhell and mwedittypes. With that plaintext we need to find the ve.Range corresponding to the paragraph in the Visual Editor session.

This task is about coming up with an initial heuristic for doing that.

Example:

Plain text of the first paragraph of the article about Wisconsin as used in the model:

Wisconsin ( ) is a state in the Upper Midwest and Great Lakes regions of the United States. It borders Minnesota to the west, Iowa to the southwest, Illinois to the south, Lake Michigan to the east, Michigan to the northeast, and Lake Superior to the north. With a population of about 6 million and an area of about 65,500 square miles, Wisconsin is the 20th-largest state by population and the 23rd-largest by area. It has 72 counties. The state's most populous city is Milwaukee. Its capital and second-most populous city is Madison; other urban areas include Green Bay and the Fox Cities.

and as available in VE:

Wisconsin (/wɪˈskɒnsɪn/ ⓘ wih-SKON-sin)[12] is a state in the Upper Midwest and Great Lakes regions of the United States. It borders Minnesota to the west, Iowa to the southwest, Illinois to the south, Lake Michigan to the east, Michigan to the northeast, and Lake Superior to the north. With a population of about 6 million[9] and an area of about 65,500 square miles, Wisconsin is the 20th-largest state by population and the 23rd-largest by area. It has 72 counties. The state's most populous city is Milwaukee. Its capital and second-most populous city is Madison; other urban areas include Green Bay and the Fox Cities.

Event Timeline

One straight forward approach could be to adopt the https://github.com/aceakash/string-similarity library (66 LOC total) into GrowthExperiments, calculate the similarity for all paragraphs and just pick the one with the highest score. There should be only one by a wide margin. We need to check the performance for large pages though.

Thanks @cscott. Documenting what you pointed out in our engineering discussion: mwparserfromhtml is built on Parsoid, so its plaintext rendering should in principle be closer to something we can replicate in VE based on the Parsoid HTML input we load.

Then a "deep link" to a paragraph could consist of an article id and an xpath (or equivalent) to the element in the Parsoid HTML representation. Template changes could affect the node numbering, so it might make sense to disregard template-generated nodes (which are clearly marked as such within Parsoid HTML).

("We found a tone violation in the 7th paragraph that Parsoid outputs for this page, after stripping out everything that Parsoid flagged as being template-related" is a thing VisualEditor can easily match up.)

Michael claimed this task.

We've converged on using a two-pronged approach. We, for now, use string similarity to identify the right paragraph and determine its position in the list of paragraphs in the article. Then both the position as well as the plain text will be arguments in a method call inside a new VE session so that the Edit Check can be shown there. The work to create that method call in VE in an ongoing session is tracked in T400335.

In the future, the approach to generate these suggestions by the ML team might be augmented with the position once they have access to HTML versions of articles via the data platform.