Page MenuHomePhabricator

html2wt for links: Ignore data-parsoid and rel types more aggressively and generate the expected canonical forms
Open, MediumPublic

Description

Preamble: This updated description is a consolidation of various bug reports (the original description of this bug that @Catrope filed), T71207: Parsoid: Interwiki links are halfway converted to external links, and completely broken and T102556: VE should indicate when it wants external links "WikiLink-ified"

The problem with Parsoid's html2wt for links (which is already complex because wikitext and its overloaded link forms and expectations around roundtripping, dirty diffs, expected canonical form output for HtTML) is best illustrated with this html2wt session:

[subbu@earth:~/work/wmf/parsoid] cat /tmp/html
<h2>New content (i.e no data-parsoid), rel-attribute is irrelevant, link serializes to appropriate canonical syntax</h2>
<a href="https://en.wikipedia.org/wiki/Hospet">Hospet</a>
<a rel="mw:WikiLink" href="https://en.wikipedia.org/wiki/Hospet">Hospet</a>
<a rel="mw:ExtLink" href="https://en.wikipedia.org/wiki/Hospet">Hospet</a>
<a href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)">on Mediawiki</a>
<a rel="mw:WikiLink/Interwiki" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)">on Mediawiki</a>
<a rel="mw:WikiLink" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)">on Mediawiki</a>
<a rel="mw:ExtLink" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)">on Mediawiki</a>

<h2>Orig unedited content, rel & data-parsoid produced by wt2html, link serialized to original (not always canonical) syntax</h2>
<a rel="mw:WikiLink/Interwiki" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)" data-parsoid='{"stx":"piped","a":{"href":"https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)"},"sa":{"href":"mw:User talk:Whatamidoing (WMF)"},"isIW":true}'>on Mediawiki</a>
<a rel="mw:WikiLink" href="./Hospet" data-parsoid='{"stx":"simple","a":{"href":"./Hospet"},"sa":{"href":"Hospet"}}'>Hospet</a>
<a rel="mw:ExtLink" href="https://en.wikipedia.org/wiki/Hospet" data-parsoid='{}'>Hospet</a>

<h2>Orig unedited content, data-parsoid edited (not possible with production clients), rel-type absent, link serialized to canonical syntax (accidentally)</h2>
<a rel="mw:WikiLink" href="./Hospet" data-parsoid='{"stx":"simple"}'>Hospet</a>

<h2>Orig unedited content, data-parsoid edited (not possible with production clients), rel-type absent, link serialized to non-canonical extlink but valid syntax</h2>
<a href="https://en.wikipedia.org/wiki/Hospet" data-parsoid='{}'>Hospet</a>
<a href="https://en.wikipedia.org/wiki/Hospet" data-parsoid='{"stx":"simple"}'>Hospet</a>
<a href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)" data-parsoid='{}'>on Mediawiki</a>

<h2>Orig unedited content, data-parsoid edited (not possible with production clients), rel-value irrelevant, link serialized to non-canonical extlink but valid syntax</h2>
<a rel="mw:ExtLink" href="https://en.wikipedia.org/wiki/Hospet" data-parsoid='{"stx":"simple"}'>Hospet</a>
<a rel="mw:WikiLink" href="https://en.wikipedia.org/wiki/Hospet" data-parsoid='{}'>Hospet</a>
<a rel="mw:WikiLink" href="https://en.wikipedia.org/wiki/Hospet" data-parsoid='{"stx":"simple"}'>Hospet</a>
<a rel="mw:WikiLink/Interwiki" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)" data-parsoid='{"stx":"piped"}'>on Mediawiki</a>
<a rel="mw:ExtLink" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)" data-parsoid='{"stx":"piped"}'>on Mediawiki</a>
<a rel="mw:WikiLink" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)" data-parsoid='{"stx":"piped"}'>on Mediawiki</a>
<a rel="mw:WikiLink/Interwiki" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)" data-parsoid='{}'>on Mediawiki</a>

<h2>Orig unedited content, data-parsoid edited (not possible with production clients), rel-attribute is irrelevant, link serialized to invalid syntax (because of relative links or bad "a"/"sa" data-parsoid values)</h2>
<a href="./Hospet" data-parsoid='{"stx":"simple"}'>Hospet</a>
<a rel="mw:ExtLink" href="./Hospet" data-parsoid='{"stx":"simple"}'>Hospet</a>
<a href="./Hospet" data-parsoid='{"stx":"simple","a":{"href":"./Hospet"},"sa":{"href":"Hospet"}}'>Hospet</a>
<a rel="mw:ExtLink" href="./Hospet" data-parsoid='{"stx":"simple","a":{"href":"./Hospet"},"sa":{"href":"Hospet"}}'>Hospet</a>
<a rel="mw:WikiLink/Interwiki" href="https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)" data-parsoid='{"stx":"piped","a":{"href":"https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF)"},"sa":{"href":"mw:User talk:Whatamidoing (WMF)"}}'>on Mediawiki</a>

Here is the generated output

subbu@earth:~/work/wmf/parsoid] php bin/parse.php --html2wt < /tmp/html
== New content (i.e no data-parsoid), rel-attribute is irrelevant, link serializes to appropriate canonical syntax ==
[[Hospet]]
[[Hospet]]
[[Hospet]]
[[mw:User talk:Whatamidoing (WMF)|on Mediawiki]]
[[mw:User talk:Whatamidoing (WMF)|on Mediawiki]]
[[mw:User talk:Whatamidoing (WMF)|on Mediawiki]]
[[mw:User talk:Whatamidoing (WMF)|on Mediawiki]]

== Orig unedited content, rel & data-parsoid produced by wt2html, link serialized to original (not always canonical) syntax ==
[[mw:User talk:Whatamidoing (WMF)|on Mediawiki]]
[[Hospet]]
[https://en.wikipedia.org/wiki/Hospet Hospet]

== Orig unedited content, data-parsoid edited (not possible with production clients), rel-type absent, link serialized to canonical syntax (accidentally) ==
[[Hospet]]

== Orig unedited content, data-parsoid edited (not possible with production clients), rel-type absent, link serialized to non-canonical extlink but valid syntax ==
[https://en.wikipedia.org/wiki/Hospet Hospet]
[https://en.wikipedia.org/wiki/Hospet Hospet]
[https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF) on Mediawiki]

== Orig unedited content, data-parsoid edited (not possible with production clients), rel-value irrelevant, link serialized to non-canonical extlink but valid syntax ==
[https://en.wikipedia.org/wiki/Hospet Hospet]
[https://en.wikipedia.org/wiki/Hospet Hospet]
[https://en.wikipedia.org/wiki/Hospet Hospet]
[https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF) on Mediawiki]
[https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF) on Mediawiki]
[https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF) on Mediawiki]
[https://www.mediawiki.org/wiki/User%20talk:Whatamidoing%20(WMF) on Mediawiki]

== Orig unedited content, data-parsoid edited (not possible with production clients), rel-attribute is irrelevant, link serialized to invalid syntax (because of relative links or bad "a"/"sa" data-parsoid values) ==
[./Hospet Hospet]
[./Hospet Hospet]
[Hospet Hospet]
[Hospet Hospet]

The good thing is that except for edge case scenarios, the broken behavior illustrated in the last segment should not be possible with current versions of Parsoid and editing clients since data-parsoid is never exposed to clients directly. However, Flow with its old stored HTML is an exception to this rule and it probably has content from way back when.

As for as Flow is concerned, one solution that Flow could adopt would be to strip data-parsoid and rel info entirely from links and Parsoid does the right thing just by analyzing the href.

We can fix this in Parsoid itself (and one early attempt via https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/292078 had to be reverted because it over-aggressively used interwiki forms. A solution for that is captured in T102556#3841768 and the next Parsoid attempt at this should use that interwiki forwarding property. But, this does mean that Parsoid will no longer preserve extlink forms of wikilinks seen in wikitext (except via selser). We resolved in T102556 that is an acceptable compromise.

While at it, Parsoid should also stop generating the "a"/"sa" data-parsoid properties for links altogether since that is not needed anymore, and at worse, when data-parsoid is corrupted / manipulated in bad ways, Parsoid uses that info without validating that information.

Event Timeline

Let us talk next week in person, but yes, Parsoid won't be able to guarantee backward compatibility (wrt serialization) of HTML versions indefinitely. But, we should think of ways to upgrade the stored content, especially when we make breaking / incompatible changes. html -> wt -> html is one way of doing that, but there might be cheaper solutions than that via the html2html endpoint.

Wikitext as a storage format is also reasonable but I don't understand the implications for Flow (given that you don't use selser right now).

But, worth considering this in conjunction with the Services team which will have version upgrade concerns to tackle as well (in the long run).

Let us talk next week in person, but yes, Parsoid won't be able to guarantee backward compatibility (wrt serialization) of HTML versions indefinitely. But, we should think of ways to upgrade the stored content, especially when we make breaking / incompatible changes. html -> wt -> html is one way of doing that, but there might be cheaper solutions than that via the html2html endpoint.

html2html would only work if the html->wt step understands the old format. That doesn't seem to be the case right now. Running my example through html2html would break it.

Wikitext as a storage format is also reasonable but I don't understand the implications for Flow (given that you don't use selser right now).

We'd probably want to use selser if we used wikitext as a storage format (although we could possibly get away with not using it, because the content tends to be small and simple, and often just a single paragraph). We don't currently use selser because it doesn't make sense to do so when the storage format is HTML AIUI.

Let us talk next week in person, but yes, Parsoid won't be able to guarantee backward compatibility (wrt serialization) of HTML versions indefinitely. But, we should think of ways to upgrade the stored content, especially when we make breaking / incompatible changes. html -> wt -> html is one way of doing that, but there might be cheaper solutions than that via the html2html endpoint.

html2html would only work if the html->wt step understands the old format. That doesn't seem to be the case right now. Running my example through html2html would break it.

Right, it won't work for old content that you currently have that might break, but I was talking about handling this in the future. And the html2html upgrade might be able to do the upgrade without going through serialization depending on the versions involved (or return a http error response if it cannot do the upgrade) in which case you have go the html -> wt and wt -> html route.

Separately, for this specific case, @cscott thinks this is a bug we have in Parsoid that should be fixed which will fix this for old content. But, we haven't really investigated yet to tell if that is the case or if this is indeed a problem with old content that newer content doesn't have.

@Catrope, in case this lets you "hack around the problem" while we fix this issue. We have progressively relaxed constraints on how a link should look like, but looks like some things still trip it up. Looks like adding https: should do the trick in your case.

[subbu@earth mocha] cat /tmp/html
<h2>Good</h2>

<a rel="mw:ExtLink" href="https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938" data-parsoid='{"stx":"piped","a":{"href":"https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"},"sa":{"href":"Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"},"isIW":true}'>+rel +orig-dp +https</a>

<a rel="mw:ExtLink" href="https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938" data-parsoid='{"a":{"href":"https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"},"sa":{"href":"Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"},"isIW":true}'>+rel +modified-dp(no-stx), +https</a>

<a rel="mw:ExtLink" href="https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938" data-parsoid='{"isIW":true}'>+rel +modified-dp(only isIW) +https</a>

<a rel="mw:ExtLink" href="https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938">+rel -dp +https</a>

<a href="https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938">-rel -dp +https</a>

<h2>Bad</h2>

<a rel="mw:ExtLink" href="//commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938" data-parsoid='{"stx":"piped","a":{"href":"//commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"},"sa":{"href":"Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"},"isIW":true}'>+rel +orig-dp -https</a>

<a href="//commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938">-rel -dp -https</a>

<a rel="mw:ExtLink" href="https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938" data-parsoid='{"a":{"href":"https://commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"},"sa":{"href":"Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938"}}'>+rel +modified-dp(no-stx, no-isIW), +https</a>
[subbu@earth mocha] parse.js --html2wt < /tmp/html
== Good ==

[[Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938|+rel +orig-dp +https]]

[[Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938|+rel +modified-dp(no-stx), +https]]

[[c:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938|+rel +modified-dp(only isIW) +https]]

[[c:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938|+rel -dp +https]]

[[c:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938|-rel -dp +https]]

== Bad ==

[Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938 +rel +orig-dp -https]

[//commons.wikimedia.org/wiki/Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938 -rel -dp -https]

[Commons:Commons:Village_pump/Copyright/Archive/2012/12#Advertisement_from_1938 +rel +modified-dp(no-stx, no-isIW), +https]
ssastry renamed this task from Buggy html2wt serialization of old HTML to html2wt: Further relax link handler requirements around expecations of input HTML for a link.Nov 6 2017, 7:19 PM
ssastry moved this task from Needs Triage to Link syntax (links & media) on the Parsoid board.
ssastry renamed this task from html2wt: Further relax link handler requirements around expecations of input HTML for a link to html2wt for links: Ignore data-parsoid and rel types more aggressively and generate the expected canonical forms.Jun 26 2020, 11:23 PM
ssastry updated the task description. (Show Details)