Page MenuHomePhabricator

Add a link: Parsing challenges (input and output)
Closed, ResolvedPublic

Description

This task is for discussing challenges with:

  1. parsing articles (either as wikitext or as Parsoid generated HTML) to use as source material for generating links
  2. outputting content from the tool that can then be used with VisualEditor to provide a guide to users for where these links should be added

Event Timeline

kostajh added a subscriber: Tgr.

From @Tgr:

I've been thinking about the Parsing email. AIUI we have four options:

  • Make the recommendation service return Parsoid documents (or some annotation that can be applied to Parsoid documents). AIUI what Djellel is doing, this would not be hard (processing HTML is probably a bit slower / more resource-intensive than processing wikitext, because the tree is deeper, OTOH it's a very standard task so there will be native implementations while mwparserfromhell is pure Python); there are no HTML dumps but there likely will be by the time we'd actually need them. But, there are many HTML representations of the same article wikitext (due to user options, desktop vs mobile, templates changing over time etc), so the recommendation would have to be applied to a slightly different document from what it was generated for. If we can commit to always generate recommendations just-in-time, this would not be an issue, otherwise it seems problematic.
  • Make the recommendation service return some annotation for the wikitext that can be converted into annotation for the specific Parsoid document. I can't imagine any other way of doing this than offsets, and the Parsing team is not keen on committing to maintain offset mapping as a public API.
  • Make the recommendation service return something that applies to wikitext, and interject in the rendering process somehow to replace the real wikitext with a modified one that will result in the extra markup we want after it's transformed by the parser.
    • a) Do this for the old parser (ie. render the extra markup into the normal pageview, not VE - we'd have to skip the editing step entirely, provide some review dialog thingy, and apply back the changes to wikitext via the web API). How? ContentGetParserOutput is too early, ContentAlterParserOutput is too late. We could add a new hook I suppose. We'd also have to split the parser cache. The resulting markup would show up in normal page views, not VE, but maybe that's not bad - we can provide some review dialog thingy, and apply back the changes to wikitext.
    • b) Do this for Parsoid (ie. render the extra markup into the VE DOM, not normal pageviews). I guess an integration point for doing that could be built into ApiVisualEditor, and that could also interact with the suggested-edits session flag somehow? We'd also have to get rid of the extra markup during save. I suppose we can use ParserPreSaveTransformComplete for that, but it seems messy.

From @cscott:

Briefly: if you could engineer your tool to use the Parsoid HTML as a basis, instead of mwparserfromhell, I think your job would be much easier. You can use Element and other python libraries for interacting with the HTML, and restbase or the new dumps service for getting the HTML.

Then you can use an xpath query to identify the particular segment in the parsoid html you are interested in, and I believe that could be translated to VE pretty directly, since it is working with the same DOM tree structure. Or as you suggest, feeding VE the Parsoid HTML with an added class added to the link text. I think you could do that by stuffing the new HTML into the 'stash' mechanism as a "not-yet-saved proposed edit".

Parsoid does maintain a mapping of wikitext offsets to particular bits of DOM tree structure, but we don't export that as a public API and I think we would be very reluctant to do so. We did discuss something similar as a way to support the "new wikitext editor", where Parsoid would export the character boundaries of certain wikitext constructs to guide the wikitext source editor, but we did not end up going that way.

I don't think the Text Fragments API will do what you want. VE constructs its content-editable canvas on the fly as it loads content from Parsoid, so it won't be in the browser's DOM tree at the time the Text Fragments API would want to scroll to it. Visual Editor uses https://developer.mozilla.org/en-US/docs/Web/API/Range internally, and the DiscussionTools work does as well; that would be the way to communicate regions with VE.

From IRC discussion yesterday with @DED, it sounds like a challenge is to ignore already linked items; the algorithm in the tool would like to have plain sentences (without links) to operate on. As C. Scott mentioned it should be possible to use a library to operate on plain text from the sentence, while also retaining the ability to manipulate the sentence to add a new mw:WikiLink on the relevant word(s).

Example Parsoid HTML for @MGerlach and @DED to reference

1<section data-mw-section-id="0" data-parsoid="{}">
2
3<p data-parsoid='{"dsr":[0,656,0,0]}'>
4The dimensions of a Turkish kanun are typically 95 to 100<span typeof="mw:Entity" data-parsoid='{"src":"&amp;nbsp;","srcContent":" ","dsr":[57,63,null,null]}'> </span>cm (37–39") in length, 38 to 40<span typeof="mw:Entity" data-parsoid='{"src":"&amp;nbsp;","srcContent":" ","dsr":[96,102,null,null]}'> </span>cm (15–16") in width, and 4 to 6<span typeof="mw:Entity" data-parsoid='{"src":"&amp;nbsp;","srcContent":" ","dsr":[136,142,null,null]}'> </span>cm (1.5–2.3") in height.<sup about="#mwt4" class="mw-ref" id="cite_ref-1" rel="dc:references" typeof="mw:Extension/ref" data-parsoid='{"dsr":[168,418,5,6]}' data-mw='{"name":"ref","attrs":{},"body":{"id":"mw-reference-text-cite_note-1"}}'><a href="./Kanonaki#cite_note-1" style="counter-reset: mw-Ref 1;" data-parsoid="{}"><span class="mw-reflink-text" data-parsoid="{}">[1]</span></a></sup> In contrast, an Arabic qanun measures a bit larger as mentioned.<sup about="#mwt6" class="mw-ref" id="cite_ref-2" rel="dc:references" typeof="mw:Extension/ref" data-parsoid='{"dsr":[483,656,5,6]}' data-mw='{"name":"ref","attrs":{},"body":{"id":"mw-reference-text-cite_note-2"}}'><a href="./Kanonaki#cite_note-2" style="counter-reset: mw-Ref 2;" data-parsoid="{}"><span class="mw-reflink-text" data-parsoid="{}">[2]</span></a></sup>
5</p>
6
7<p data-parsoid='{"dsr":[658,991,0,0]}'>
8Qanun is played on the lap while sitting or squatting, or sometimes on trestle support, by plucking the strings with two <a rel="mw:WikiLink" href="./Tortoise" title="Tortoise" data-parsoid='{"stx":"simple","a":{"href":"./Tortoise"},"sa":{"href":"tortoise"},"dsr":[779,791,2,2]}' class="new">tortoise</a>-shell picks (one for each hand) or with fingernails, and has a standard range of three and a half <a rel="mw:WikiLink" href="./Octave" title="Octave" data-parsoid='{"stx":"simple","a":{"href":"./Octave"},"sa":{"href":"octave"},"dsr":[890,901,2,3],"tail":"s"}' class="new">octaves</a> from A2 to E6 that can be extended down to F2 and up to G6 in the case of Arabic designs.
9</p>
10
11<p data-parsoid='{"dsr":[993,1606,0,0]}'>
12The instrument also features special metallic levers or latches under each course called <i data-parsoid='{"dsr":[1082,1093,2,2]}'>mandals</i>. These small levers, which can be raised or lowered quickly by the performer while the instrument is being played, serve to slightly change the pitch of a particular course by altering effective string lengths.<sup about="#mwt9" class="mw-ref" id="cite_ref-Kassabian2013_3-0" rel="dc:references" typeof="mw:Extension/ref" data-parsoid='{"dsr":[1304,1606,26,6]}' data-mw='{"name":"ref","attrs":{"name":"Kassabian2013"},"body":{"id":"mw-reference-text-cite_note-Kassabian2013-3"}}'><a href="./Kanonaki#cite_note-Kassabian2013-3" style="counter-reset: mw-Ref 3;" data-parsoid="{}"><span class="mw-reflink-text" data-parsoid="{}">[3]</span></a></sup>
13</p>
14
15<div class="mw-references-wrap" typeof="mw:Extension/references" about="#mwt10" data-parsoid='{"dsr":[1606,1606,0,0]}' data-mw='{"name":"references","attrs":{},"autoGenerated":true}'>
16<ol class="mw-references references" data-parsoid="{}"><li about="#cite_note-1" id="cite_note-1" data-parsoid="{}"><a href="./Kanonaki#cite_ref-1" rel="mw:referencedBy" data-parsoid="{}"><span class="mw-linkback-text" data-parsoid="{}"></span></a> <span id="mw-reference-text-cite_note-1" class="mw-reference-text" data-parsoid="{}"><a rel="mw:WikiLink" href="./Πρότυπο:Cite_web" title="Πρότυπο:Cite web" about="#mwt3" typeof="mw:Transclusion" data-parsoid='{"stx":"simple","a":{"href":"./Πρότυπο:Cite_web"},"sa":{"href":":Πρότυπο:Cite web"},"dsr":[173,411,null,null],"pi":[[{"k":"last1","named":true,"spc":["","",""," "]},{"k":"first1","named":true,"spc":["","",""," "]},{"k":"last2","named":true,"spc":["","",""," "]},{"k":"first2","named":true,"spc":["","",""," "]},{"k":"date","named":true,"spc":["",""," ",""]},{"k":"title","named":true,"spc":["","",""," "]},{"k":"publisher","named":true,"spc":["","",""," "]},{"k":"url","named":true,"spc":["","",""," "]},{"k":"accessdate","named":true}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"cite web ","href":"./Πρότυπο:Cite_web"},"params":{"last1":{"wt":"Aydoğdu"},"first1":{"wt":"Gültekin"},"last2":{"wt":"Aydoğdu"},"first2":{"wt":"Tahir"},"date":{"wt":"2018-02-11"},"title":{"wt":"&apos;Kanun&apos; hakkında"},"publisher":{"wt":"turksanatmuzigi.org: Salih Bora"},"url":{"wt":"http://www.turksanatmuzigi.org/bilgiler/calgilarimiz/kanun"},"accessdate":{"wt":""}},"i":0}}]}' class="new">Πρότυπο:Cite web</a>
17</span></li><li about="#cite_note-2" id="cite_note-2" data-parsoid="{}"><a href="./Kanonaki#cite_ref-2" rel="mw:referencedBy" data-parsoid="{}"><span class="mw-linkback-text" data-parsoid="{}"></span></a> <span id="mw-reference-text-cite_note-2" class="mw-reference-text" data-parsoid="{}"><a rel="mw:WikiLink" href="./Πρότυπο:Cite_web" title="Πρότυπο:Cite web" about="#mwt5" typeof="mw:Transclusion" data-parsoid='{"stx":"simple","a":{"href":"./Πρότυπο:Cite_web"},"sa":{"href":":Πρότυπο:Cite web"},"dsr":[488,650,null,null],"pi":[[{"k":"title","named":true,"spc":["","",""," "]},{"k":"website","named":true,"spc":["","",""," "]},{"k":"url","named":true,"spc":["","",""," "]},{"k":"access-date","named":true}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"cite web ","href":"./Πρότυπο:Cite_web"},"params":{"title":{"wt":"About The Qanun"},"website":{"wt":"www.middleeasterndance.net"},"url":{"wt":"http://www.middleeasterndance.net/Music/Kanun/AboutKanun.html"},"access-date":{"wt":"2016-06-26"}},"i":0}}]}' class="new">Πρότυπο:Cite web</a></span></li><li about="#cite_note-Kassabian2013-3" id="cite_note-Kassabian2013-3" data-parsoid="{}"><a href="./Kanonaki#cite_ref-Kassabian2013_3-0" rel="mw:referencedBy" data-parsoid="{}"><span class="mw-linkback-text" data-parsoid="{}"></span></a> <span id="mw-reference-text-cite_note-Kassabian2013-3" class="mw-reference-text" data-parsoid="{}"><a rel="mw:WikiLink" href="./Πρότυπο:Cite_book" title="Πρότυπο:Cite book" about="#mwt8" typeof="mw:Transclusion" data-parsoid='{"stx":"simple","a":{"href":"./Πρότυπο:Cite_book"},"sa":{"href":":Πρότυπο:Cite book"},"dsr":[1330,1600,null,null],"pi":[[{"k":"last","named":true,"spc":["","",""," "]},{"k":"first","named":true,"spc":["","",""," "]},{"k":"year","named":true,"spc":["","",""," "]},{"k":"title","named":true,"spc":["","",""," "]},{"k":"publisher","named":true,"spc":["","",""," "]},{"k":"isbn","named":true,"spc":["","",""," "]},{"k":"pages","named":true,"spc":["","",""," "]},{"k":"url","named":true}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"cite book ","href":"./Πρότυπο:Cite_book"},"params":{"last":{"wt":"Kassabian"},"first":{"wt":"Anahid"},"year":{"wt":"2013"},"title":{"wt":"Ubiquitous Listening: Affect, Attention, and Distributed Subjectivity"},"publisher":{"wt":"University of California Press"},"isbn":{"wt":"978-0-520-27515-7"},"pages":{"wt":"79–"},"url":{"wt":"https://books.google.com/books?id=rOfvO7fLuNAC&amp;pg=PA79"}},"i":0}}]}' class="new">Πρότυπο:Cite book</a></span></li></ol>
18</div>
19
20</section>

And the same content in Wikitext

1The dimensions of a Turkish kanun are typically 95 to 100&nbsp;cm (37–39") in length, 38 to 40&nbsp;cm (15–16") in width, and 4 to 6&nbsp;cm (1.5–2.3") in height.<ref>{{cite web |last1=Aydoğdu |first1=Gültekin |last2=Aydoğdu |first2=Tahir |date= 2018-02-11|title='Kanun' hakkında |publisher=turksanatmuzigi.org: Salih Bora |url=http://www.turksanatmuzigi.org/bilgiler/calgilarimiz/kanun |accessdate=}}
2</ref> In contrast, an Arabic qanun measures a bit larger as mentioned.<ref>{{cite web |title=About The Qanun |website=www.middleeasterndance.net |url=http://www.middleeasterndance.net/Music/Kanun/AboutKanun.html |access-date=2016-06-26}}</ref>
3
4Qanun is played on the lap while sitting or squatting, or sometimes on trestle support, by plucking the strings with two [[tortoise]]-shell picks (one for each hand) or with fingernails, and has a standard range of three and a half [[octave]]s from A2 to E6 that can be extended down to F2 and up to G6 in the case of Arabic designs.
5
6The instrument also features special metallic levers or latches under each course called ''mandals''. These small levers, which can be raised or lowered quickly by the performer while the instrument is being played, serve to slightly change the pitch of a particular course by altering effective string lengths.<ref name="Kassabian2013">{{cite book |last=Kassabian |first=Anahid |year=2013 |title=Ubiquitous Listening: Affect, Attention, and Distributed Subjectivity |publisher=University of California Press |isbn=978-0-520-27515-7 |pages=79– |url=https://books.google.com/books?id=rOfvO7fLuNAC&pg=PA79}}</ref>

kostajh added subscribers: Esanders, ssastry.

@cscott @ssastry we're considering an implementation where, instead of attempting to compute the offsets where we want text to be linked, we just know a list of phrases that should be linked, as well as how many times that phrase has occurred in the document before the link is supposed to be added.

For example, given an article with text like this:

In contrast, an Arabic qanun measures a bit larger as mentioned.

The qanun is played on the lap while sitting or squatting, or sometimes on trestle support.

The link recommender might propose that the following phrases should be linked: "Arabic qanun", "qanun", "trestle support", with some metadata like:

[ 
{
  phrase: "Arabic qanun",
  occurrence: 0,
  linkTarget: "Arabic_Qanun"  
},
{
  phrase: "qanun",
  occurrence: 1,
  linkTarget: "Qanun"
}
]

Our frontend tool would use VisualEditor's content editable area (see ve.ui.FindAndReplaceDialog for a rough idea of what we'd be doing) to find e.g. "qanun", and since we know that there is 1 occurrence before the phrase we are going to create a link suggestion for, it would propose "qanun" in the second paragraph and not "qanun" in "Arabic qanun" in the first paragraph.

@Tgr and @Esanders pointed out that we need to be careful about what kind of text conversion happens between wikitext and Parsoid HTML (entity decoding, Unicode normalization, more?). Do you think if the mwaddlink tool is generating phrase recommendations with wikitext, that we will then run into problems in locating those phrases in VisualEditor due to text conversion? Should we use the transform-wikitext-to-html API for phrases that the mwaddlink tool provides to work around any potential issues?

kostajh claimed this task.
kostajh added a project: Add-Link.

What we've decided to do as our first pass (see meeting notes from October 14 and 15 here):

  1. mwaddlink will continue to use wikitext dumps for training its model and populating lookup dictionaries
  2. the query service will use wikitext as its input
  3. the query service will return a list of recommendations, each recommendation contains: phrase to link, link target, number of occurrences on the page before the one we want to link, insertion order, context (5 characters before, 5 characters after the phrase) and a parsoid HTML conversion of the phrase to link, to be used as a backup in case the wikitext can't be found in VE's content editable surface.

If there are any concerns with this approach or you have examples where this approach would break, please let us know.